Applied Statistics in Business & Economics, 5 edition · 2017. 4. 27. · Prepared by Lloyd R....

McGraw-Hill/Irwin Copyright © 2015 by The McGraw-Hill Companies, Inc. All rights reserved.

A PowerPoint Presentation Package to Accompany

Applied Statistics in Business &

Economics, 5th edition

David P. Doane and Lori E. Seward

Prepared by Lloyd R. Jaisingh

12-2

Simple Regression

Chapter Contents

12.1 Visual Displays and Correlation Analysis

12.2 Simple Regression

12.3 Regression Models

12.4 Ordinary Least Squares Formulas

12.5 Tests for Significance

12.6 Analysis of Variance: Overall Fit

12.7 Confidence and Prediction Intervals for Y

Ch

ap

ter 1

2

12-3

Simple Regression

Chapter Contents

12.8 Residual Tests

12.9 Unusual Observations

12.10 Other Regression Problems (Optional)

Ch

ap

ter 1

2

12-4

Chapter Learning Objectives (LO’s)

LO12-1: Calculate and test a correlation coefficient for significance.

LO12-2: Interpret a regression equation and use it to make predictions.

LO12-3: Explain the form and assumptions of a simple regression model.

LO12-4: Explain the least squares method, apply formulas for coefficients,

and interpret 𝑅2.

LO12-5: Construct confidence intervals and test hypotheses for the slope

and intercept.

LO12-6: Interpret the ANOVA table and use it to compute F, 𝑅2, and

standard error.

Ch

ap

ter 1

2

Simple Regression

12-5

Chapter Learning Objectives (LO’s)

LO12-7: Distinguish between confidence and prediction intervals for Y.

LO12-8: Calculate residuals and perform tests of regression

assumptions.

LO12-9: Identify unusual residuals and tell when they are outliers.

LO12-10: Define leverage and identify high-leverage observations.

LO12-11: Improve data conditioning and use transformations if needed

(Optional).

Ch

ap

ter 1

2

Simple Regression

12.1 Visual Displays and

Correlation Analysis

• Begin the analysis of bivariate data (i.e., two variables) with a

scatter plot.

• A scatter plot

- displays each observed data pair (xi, yi) as a dot on an X/Y grid.

- indicates visually the strength of the relationship between the

two variables.

Visual Displays

Ch

ap

ter 1

2

12-6

Sample Scatter Plot

Correlation Coefficient, r

Note: -1 ≤ r ≤ +1

• The sample correlation coefficient (r) measures the degree of linearity in the relationship between X and Y.

r = 0 indicates no linear

relationship

Ch

ap

ter 1

2

LO12-1: Calculate and test a correlation coefficient for

significance.

LO12-112.1 Visual Displays and


12-7

12-8

Scatter Plots Showing Various Correlation Values

Ch

ap

ter 1

2

12.1 Visual Displays and

Correlation AnalysisLO12-1

• Step 1: State the Hypotheses

Determine whether you are using a one or two-tailed test and the

level of significance (a).

H0: r = 0

H1: r ≠ 0

• Step 2: Specify the Decision Rule

For degrees of freedom df = n -2, look up the critical value ta in

Appendix D.

Tests for Significant Correlation Using Student’s t

• Note: r is an estimate of the population

correlation coefficient r (rho).

Ch

ap

ter 1

2



12-9

• Step 3: Calculate the Test Statistic

• Step 4: Make the Decision

Reject H0 if t > ta/2 or if t < -ta/2

• .Also, reject H0 the if the p-value a.

Tests for Significant Correlation Using Student’s t

Ch

ap

ter 1

2



12-10

• Equivalently, you can calculate the critical value for the correlation

coefficient using

• This method gives a benchmark for the correlation coefficient.

• However, there is no p-value and is inflexible if you change your

mind about a.

• MegaStat uses this method, giving two-tail critical values for

a = 0.05 and a = 0.01.

Critical Value for Correlation Coefficient (Tests for Significance)

Ch

ap

ter 1

2



12-11

Ch

ap

ter 1

2



12-12

• Simple Regression analyzes the relationship between two

variables.

• It specifies one dependent (response) variable and one

independent (predictor) variable.

• The hypothesized relationship here will be linear of the form

Y = slope X + y-intercept..

What is Simple Regression?

Ch

ap

ter 1

2


12-13

LO12-2

LO12-2: Interpret the slope and intercept of a regression equation

and use it to make prediction.

Interpreting an Estimated Regression Equation: Examples

Ch

ap

ter 1

2

12.2 Simple RegressionLO12-2

12-14

Prediction Using Regression: Examples

Ch

ap

ter 1

2

12.2 Simple RegressionLO12-2

12-15

NOTES:

Ch

ap

ter 1

2


12-16

LO12-2

• The assumed model for a linear relationship is

y = b0 + b1x + e.

• The relationship holds for all pairs (xi , yi ).

• The error term e is not observable, is assumed to be independently

normally distributed with mean of 0 and standard deviation s.

• The unknown parameters are:

b0 Intercept

b1 Slope.

Model and Parameters

Ch

ap

ter 1

2


12-17

LO12-3

LO12-3: Explain the form and assumptions of a simple

regression model.

• The fitted model or regression model is used to predict the

expected value of Y for a given value of X and is given below.

• The fitted coefficients are

b0 the estimated intercept

b1 the estimated slope

Model and Parameters

Ch

ap

ter 1

2


12-18

LO12-3

Ch

ap

ter 1

2

LO12-3

A more precise method is to let Excel

calculate the estimates. We enter

observations on the independent

variable x1, x2, . . ., xn and the

dependent variable y1, y2, . . ., yn into

separate columns, and let Excel fit the

regression equation, as illustrated in

Figure 12.6. Excel will choose the

regression coefficients so as to

produce a good fi t


12-19

Fitting a Regression on a Scatter Plot

Ch

ap

ter 1

2

LO12-3 12.3 Regression Models

Slope and Intercept Interpretations

• Figure 12.6 (previous slide) shows a sample of miles per gallon and

horsepower for 15 engines. The Excel graph and its fitted regression

equation are also shown.

• Slope Interpretation: The slope of -0.0785 says that for each

additional unit of engine horsepower, the miles per gallon decreases

by 0.0785 mile. This estimated slope is a statistic because a different

sample might yield a different estimate of the slope.

• Intercept Interpretation: The intercept value of 49.216 suggests

that when the engine has no horsepower, the fuel efficiency would

be quite high. However, the intercept has little meaning in this case,

not only because zero horsepower makes no logical sense, but also

because extrapolating to x = 0 is beyond the range of the observed

data.

12-20

• The ordinary least squares method (OLS) estimates the slope

and intercept of the regression line so that the sum of residuals is

minimized which will ensure the best fit.

• The sum of the residuals = 0.

• The sum of the squared residuals is SSE.

Slope and Intercept

Ch

ap

ter 1

2

12.4 Ordinary Least Squares (OLS)

Formulas

12-21

LO12-4

LO12-4: Explain the least squares method, apply

formulas for coefficients, and interpret 𝑅2.

• The OLS estimator for the slope is:

• The OLS estimator for the intercept is:

Slope and Intercept

Ch

ap

ter 1

2


Formulas

12-22

LO12-4

or

Slope and Intercept

Ch

ap

ter 1

2


Formulas

12-23

LO12-4

*Recall from Chapter 8 that an unbiased estimator’s expected value is the true

parameter and that a consistent estimator approaches ever closer to the true

parameter as the sample size increases.

• We want to explain the total variation in Y around its mean (SST for

Total Sums of Squares).

• The regression sum of squares (SSR) is the explained variation in Y.

Assessing Fit

Ch

ap

ter 1

2


Formulas

12-24

LO12-4

• The error sum of squares (SSE) is the unexplained variation in Y.

• If the fit is good, SSE will be relatively small compared to SST.

• A perfect fit is indicated by an SSE = 0.

• The magnitude of SSE depends on n and on the units of

measurement.

Assessing Fit

Ch

ap

ter 1

2


Formulas

12-25

LO12-4

Coefficient of Determination

• Often expressed as a percent, an R2 = 1 (i.e., 100%) indicates

perfect fit. In simple regression, R2 = (r)2

• R2 is a measure of relative fit based on a comparison of SSR and SST.

Ch

ap

ter 1

2


Formulas

12-26

LO12-4

• The standard error (𝑠𝑒) is an overall measure of model fit.

Standard Error of Regression

• If the fitted model’s predictions are perfect

(SSE = 0), then s = 0. Thus, a small 𝑠𝑒 indicates a better fit.

• Used to construct confidence intervals.

• Magnitude of 𝑠𝑒 depends on the units of measurement of Y and

on data magnitude.

Ch

ap

ter 1

2

12.5 Test For Significance

LO12-5: Construct confidence intervals and test

hypotheses for the slope and intercept.

LO12-5

12-27

• Standard error of the slope and intercept:

Confidence Intervals for Slope and Intercept

Ch

ap

ter 1

2

12.5 Test For SignificanceLO12-5

12-28

• Confidence interval for the true slope and intercept:

Confidence Intervals for Slope and Intercept

• Note: One can use Excel, Minitab, MegaStat or

other software to compute these intervals

and do hypothesis tests relating to linear regression.

Ch

ap

ter 1

2


12-29

• Is the true slope different from zero? Well, if b1 = 0, then X cannot

influence Y and the regression model collapses to a constant b0

plus random error.

• The hypotheses (for zero slope and/or intercept) to be tested are:

Hypothesis Tests

Ch

ap

ter 1

2


df = n -2

Reject H0 if tcalc > ta/2

or if p-value a.

12-30

12-31

• The decomposition of variance may be written as

Decomposition of Variance

Ch

ap

ter 1

2


LO12-6: Interpret the ANOVA table and use it to calculate F, R2, and

the standard error.

LO12-6

12-32

• To test a regression for overall significance, we use an F test to

compare the explained (SSR) and unexplained (SSE) sums of

squares.

F Test for Overall Fit

Ch

ap

ter 1

2


LO12-6: Interpret the ANOVA table and use it to calculate F, R2, and

the standard error.

LO12-6

12-33

12.7 Confidence and Prediction

Intervals for Y

• Confidence Interval for the conditional mean of Y.

• Prediction intervals are wider than confidence intervals because

individual Y values vary more than the mean of Y.

How to Construct an Interval Estimate for Y

Ch

ap

ter 1

2

LO12-7: Distinguish between confidence and prediction

intervals for Y.

LO12-7

12-34

12.8 Residual Tests

Three Important Assumptions

1. The errors are normally distributed.

2. The errors have constant variance (i.e., they are homoscedastic).

3. The errors are independent (i.e., they are nonautocorrelated).

Ch

ap

ter 1

2LO12-8: Calculate residuals and perform tests of

regression assumptions.

Violation of Assumption 1: Non-normal Errors

• Non-normality of errors is a mild violation since the regression

parameter estimates b0 and b1 and their variances remain

unbiased and consistent.

• Confidence intervals for the parameters may be untrustworthy

because normality assumption is used to justify using

Student’s t distribution.

LO12-8

12-35

Non-normal Errors

• A large sample size would compensate.

• Outliers could pose serious problems.

Ch

ap

ter 1

2

Normal Probability Plot

• The Normal Probability Plot tests the assumption

H0: Errors are normally distributed

H1: Errors are not normally distributed

• If H0 is true, the

residual probability

plot should be linear

as shown in the example.

12.8 Residual TestsLO12-8

12-36

What to Do About Non-Normality?

1. Trim outliers only if they clearly are mistakes.

2. Increase the sample size if possible.

3. Try a logarithmic transformation of both X and Y.

Ch

ap

ter 1

2

Violation of Assumption 2: Nonconstant Variance

• The ideal condition is if the error magnitude is constant (i.e.,

errors are homoscedastic).


12-37

Violation of Assumption 2: Nonconstant Variance

• Heteroscedastic (nonconstant) errors increase or decrease with X.

• In the most common form of heteroscedasticity, the variances of the

estimators are likely to be understated.

• This results in overstated t statistics and artificially narrow

confidence intervals.

Ch

ap

ter 1

2

Tests for Heteroscedasticity

• Plot the residuals against X.

Ideally, there is no pattern in the

residuals moving from left to right.


12-38

Tests for Heteroscedasticity

• The “fan-out” pattern of increasing residual variance is the most

common pattern indicating heteroscedasticity.

Ch

ap

ter 1

2


12-39

What to Do About Heteroscedasticity?

• Transform both X and Y, for example, by taking logs.

• Although it can widen the confidence intervals for the coefficients,

heteroscedasticity does not bias the estimates.

Ch

ap

ter 1

2

Violation of Assumption 3: Autocorrelated Errors

• Autocorrelation is a pattern of non-independent errors.

• In a first-order autocorrelation, et is correlated with et-1.

• The estimated variances of the OLS estimators are biased,

resulting in confidence intervals that are too narrow, overstating the

model’s fit.


12-40

Runs Test for Autocorrelation

• In the runs test, count the number of the residual’s sign reversals (i.e., how

often does the residual cross the zero centerline?).

• If the pattern is random, the number of sign changes should be n/2.

• Fewer than n/2 would suggest positive autocorrelation.

• More than n/2 would suggest negative autocorrelation.

Ch

ap

ter 1

2

Durbin-Watson (DW) Test

• Tests for autocorrelation under the hypotheses

H0: Errors are non-autocorrelated

H1: Errors are autocorrelated

• The DW statistic will range from 0 to 4.

DW < 2 suggests positive autocorrelation

DW = 2 suggests no autocorrelation (ideal)

DW > 2 suggests negative autocorrelation


12-41

What to Do About Autocorrelation?

• Transform both variables using the method of first differences in

which both variables are redefined as changes. Then we regress Y

against X.

• Although it can widen the confidence interval for the coefficients,

autocorrelation does not bias the estimates.

Ch

ap

ter 1

2


12-42


Standardized Residuals

• One can use Excel, Minitab, MegaStat or other software to compute

standardized residuals.

• If the absolute value of any standardized residual is at least 2, then it is

classified as unusual.

Ch

ap

ter 1

2LO12-9: Identify unusual residuals and tell when they are

outliers.

LO12-9

12-43


Ch

ap

ter 1

2

High Leverage

• A high leverage statistic indicates the observation is far from the

mean of X.

• These observations are influential because they are at the “ end

of the lever.”

• The leverage for observation i is denoted hi .

LO12-10

LO12-10: Define leverage and identify high leverage

observations.

12-44

High Leverage

• A leverage that exceeds 3/n is unusual.

Ch

ap

ter 1

2

12.9 Unusual ObservationsLO12-10

12-4512B-45

12.10 Other Regression Problems

(optional)

Outliers

To fix the problem,

- delete the observation(s)

- delete the data

- formulate a multiple regression

model that includes the lurking

variable.

Outliers may be caused by

- an error in recording

data

- impossible data

- an observation that has

been influenced by an

unspecified “lurking”

variable that should

have been controlled

but wasn’t.

Ch

ap

ter 1

2

LO12-11

LO12-11: Improve data conditioning and use

transformations if needed (optional).

12-46

Model Misspecification

• If a relevant predictor has been omitted, then the model is

misspecified.

• Use multiple regression instead of bivariate regression.

Ill-Conditioned Data

• Well-conditioned data values are of the same general order of

magnitude.

• Ill-conditioned data have unusually large or small data values and

can cause loss of regression accuracy or awkward estimates.

Ch

ap

ter 1

2


(optional)LO12-11

12-47

Ill-Conditioned Data

• Avoid mixing magnitudes by adjusting the magnitude of your data

before running the regression.

Spurious Correlation

• In a spurious correlation two variables appear related because of

the way they are defined.

• This problem is called the size effect or problem of totals.

Ch

ap

ter 1

2


(optional)LO12-11

12-48

Model Form and Variable Transforms

• Sometimes a nonlinear model is a better fit than a linear model.

• Excel offers many model forms.

• Variables may be transformed (e.g., logarithmic or exponential

functions) in order to provide a better fit.

• Log transformations reduce heteroscedasticity.

• Nonlinear models may be difficult to interpret.

Ch

ap

ter 1

2


(optional)LO12-11

Applied Statistics in Business & Economics, 5 edition · 2017. 4. 27. · Prepared by Lloyd R....

Documents

Transcript of Applied Statistics in Business & Economics, 5 edition · 2017. 4. 27. · Prepared by Lloyd R....