Applied Statistics in Business & Economics, 5 edition · 2017. 4. 27. · Prepared by Lloyd R....
Transcript of Applied Statistics in Business & Economics, 5 edition · 2017. 4. 27. · Prepared by Lloyd R....
McGraw-Hill/Irwin Copyright © 2015 by The McGraw-Hill Companies, Inc. All rights reserved.
A PowerPoint Presentation Package to Accompany
Applied Statistics in Business &
Economics, 5th edition
David P. Doane and Lori E. Seward
Prepared by Lloyd R. Jaisingh
12-2
Simple Regression
Chapter Contents
12.1 Visual Displays and Correlation Analysis
12.2 Simple Regression
12.3 Regression Models
12.4 Ordinary Least Squares Formulas
12.5 Tests for Significance
12.6 Analysis of Variance: Overall Fit
12.7 Confidence and Prediction Intervals for Y
Ch
ap
ter 1
2
12-3
Simple Regression
Chapter Contents
12.8 Residual Tests
12.9 Unusual Observations
12.10 Other Regression Problems (Optional)
Ch
ap
ter 1
2
12-4
Chapter Learning Objectives (LO’s)
LO12-1: Calculate and test a correlation coefficient for significance.
LO12-2: Interpret a regression equation and use it to make predictions.
LO12-3: Explain the form and assumptions of a simple regression model.
LO12-4: Explain the least squares method, apply formulas for coefficients,
and interpret 𝑅2.
LO12-5: Construct confidence intervals and test hypotheses for the slope
and intercept.
LO12-6: Interpret the ANOVA table and use it to compute F, 𝑅2, and
standard error.
Ch
ap
ter 1
2
Simple Regression
12-5
Chapter Learning Objectives (LO’s)
LO12-7: Distinguish between confidence and prediction intervals for Y.
LO12-8: Calculate residuals and perform tests of regression
assumptions.
LO12-9: Identify unusual residuals and tell when they are outliers.
LO12-10: Define leverage and identify high-leverage observations.
LO12-11: Improve data conditioning and use transformations if needed
(Optional).
Ch
ap
ter 1
2
Simple Regression
12.1 Visual Displays and
Correlation Analysis
• Begin the analysis of bivariate data (i.e., two variables) with a
scatter plot.
• A scatter plot
- displays each observed data pair (xi, yi) as a dot on an X/Y grid.
- indicates visually the strength of the relationship between the
two variables.
Visual Displays
Ch
ap
ter 1
2
12-6
Sample Scatter Plot
Correlation Coefficient, r
Note: -1 ≤ r ≤ +1
• The sample correlation coefficient (r) measures the degree of linearity in the relationship between X and Y.
r = 0 indicates no linear
relationship
Ch
ap
ter 1
2
LO12-1: Calculate and test a correlation coefficient for
significance.
LO12-112.1 Visual Displays and
Correlation Analysis
12-7
12-8
Scatter Plots Showing Various Correlation Values
Ch
ap
ter 1
2
12.1 Visual Displays and
Correlation AnalysisLO12-1
• Step 1: State the Hypotheses
Determine whether you are using a one or two-tailed test and the
level of significance (a).
H0: r = 0
H1: r ≠ 0
• Step 2: Specify the Decision Rule
For degrees of freedom df = n -2, look up the critical value ta in
Appendix D.
Tests for Significant Correlation Using Student’s t
• Note: r is an estimate of the population
correlation coefficient r (rho).
Ch
ap
ter 1
2
LO12-112.1 Visual Displays and
Correlation Analysis
12-9
• Step 3: Calculate the Test Statistic
• Step 4: Make the Decision
Reject H0 if t > ta/2 or if t < -ta/2
• .Also, reject H0 the if the p-value a.
Tests for Significant Correlation Using Student’s t
Ch
ap
ter 1
2
LO12-112.1 Visual Displays and
Correlation Analysis
12-10
• Equivalently, you can calculate the critical value for the correlation
coefficient using
• This method gives a benchmark for the correlation coefficient.
• However, there is no p-value and is inflexible if you change your
mind about a.
• MegaStat uses this method, giving two-tail critical values for
a = 0.05 and a = 0.01.
Critical Value for Correlation Coefficient (Tests for Significance)
Ch
ap
ter 1
2
LO12-112.1 Visual Displays and
Correlation Analysis
12-11
Ch
ap
ter 1
2
LO12-112.1 Visual Displays and
Correlation Analysis
12-12
• Simple Regression analyzes the relationship between two
variables.
• It specifies one dependent (response) variable and one
independent (predictor) variable.
• The hypothesized relationship here will be linear of the form
Y = slope X + y-intercept..
What is Simple Regression?
Ch
ap
ter 1
2
12.2 Simple Regression
12-13
LO12-2
LO12-2: Interpret the slope and intercept of a regression equation
and use it to make prediction.
Interpreting an Estimated Regression Equation: Examples
Ch
ap
ter 1
2
12.2 Simple RegressionLO12-2
12-14
Prediction Using Regression: Examples
Ch
ap
ter 1
2
12.2 Simple RegressionLO12-2
12-15
NOTES:
Ch
ap
ter 1
2
12.2 Simple Regression
12-16
LO12-2
• The assumed model for a linear relationship is
y = b0 + b1x + e.
• The relationship holds for all pairs (xi , yi ).
• The error term e is not observable, is assumed to be independently
normally distributed with mean of 0 and standard deviation s.
• The unknown parameters are:
b0 Intercept
b1 Slope.
Model and Parameters
Ch
ap
ter 1
2
12.3 Regression Models
12-17
LO12-3
LO12-3: Explain the form and assumptions of a simple
regression model.
• The fitted model or regression model is used to predict the
expected value of Y for a given value of X and is given below.
• The fitted coefficients are
b0 the estimated intercept
b1 the estimated slope
Model and Parameters
Ch
ap
ter 1
2
12.3 Regression Models
12-18
LO12-3
Ch
ap
ter 1
2
LO12-3
A more precise method is to let Excel
calculate the estimates. We enter
observations on the independent
variable x1, x2, . . ., xn and the
dependent variable y1, y2, . . ., yn into
separate columns, and let Excel fit the
regression equation, as illustrated in
Figure 12.6. Excel will choose the
regression coefficients so as to
produce a good fi t
12.3 Regression Models
12-19
Fitting a Regression on a Scatter Plot
Ch
ap
ter 1
2
LO12-3 12.3 Regression Models
Slope and Intercept Interpretations
• Figure 12.6 (previous slide) shows a sample of miles per gallon and
horsepower for 15 engines. The Excel graph and its fitted regression
equation are also shown.
• Slope Interpretation: The slope of -0.0785 says that for each
additional unit of engine horsepower, the miles per gallon decreases
by 0.0785 mile. This estimated slope is a statistic because a different
sample might yield a different estimate of the slope.
• Intercept Interpretation: The intercept value of 49.216 suggests
that when the engine has no horsepower, the fuel efficiency would
be quite high. However, the intercept has little meaning in this case,
not only because zero horsepower makes no logical sense, but also
because extrapolating to x = 0 is beyond the range of the observed
data.
12-20
• The ordinary least squares method (OLS) estimates the slope
and intercept of the regression line so that the sum of residuals is
minimized which will ensure the best fit.
• The sum of the residuals = 0.
• The sum of the squared residuals is SSE.
Slope and Intercept
Ch
ap
ter 1
2
12.4 Ordinary Least Squares (OLS)
Formulas
12-21
LO12-4
LO12-4: Explain the least squares method, apply
formulas for coefficients, and interpret 𝑅2.
• The OLS estimator for the slope is:
• The OLS estimator for the intercept is:
Slope and Intercept
Ch
ap
ter 1
2
12.4 Ordinary Least Squares (OLS)
Formulas
12-22
LO12-4
or
Slope and Intercept
Ch
ap
ter 1
2
12.4 Ordinary Least Squares (OLS)
Formulas
12-23
LO12-4
*Recall from Chapter 8 that an unbiased estimator’s expected value is the true
parameter and that a consistent estimator approaches ever closer to the true
parameter as the sample size increases.
• We want to explain the total variation in Y around its mean (SST for
Total Sums of Squares).
• The regression sum of squares (SSR) is the explained variation in Y.
Assessing Fit
Ch
ap
ter 1
2
12.4 Ordinary Least Squares (OLS)
Formulas
12-24
LO12-4
• The error sum of squares (SSE) is the unexplained variation in Y.
• If the fit is good, SSE will be relatively small compared to SST.
• A perfect fit is indicated by an SSE = 0.
• The magnitude of SSE depends on n and on the units of
measurement.
Assessing Fit
Ch
ap
ter 1
2
12.4 Ordinary Least Squares (OLS)
Formulas
12-25
LO12-4
Coefficient of Determination
• Often expressed as a percent, an R2 = 1 (i.e., 100%) indicates
perfect fit. In simple regression, R2 = (r)2
• R2 is a measure of relative fit based on a comparison of SSR and SST.
Ch
ap
ter 1
2
12.4 Ordinary Least Squares (OLS)
Formulas
12-26
LO12-4
• The standard error (𝑠𝑒) is an overall measure of model fit.
Standard Error of Regression
• If the fitted model’s predictions are perfect
(SSE = 0), then s = 0. Thus, a small 𝑠𝑒 indicates a better fit.
• Used to construct confidence intervals.
• Magnitude of 𝑠𝑒 depends on the units of measurement of Y and
on data magnitude.
Ch
ap
ter 1
2
12.5 Test For Significance
LO12-5: Construct confidence intervals and test
hypotheses for the slope and intercept.
LO12-5
12-27
• Standard error of the slope and intercept:
Confidence Intervals for Slope and Intercept
Ch
ap
ter 1
2
12.5 Test For SignificanceLO12-5
12-28
• Confidence interval for the true slope and intercept:
Confidence Intervals for Slope and Intercept
• Note: One can use Excel, Minitab, MegaStat or
other software to compute these intervals
and do hypothesis tests relating to linear regression.
Ch
ap
ter 1
2
12.5 Test For SignificanceLO12-5
12-29
• Is the true slope different from zero? Well, if b1 = 0, then X cannot
influence Y and the regression model collapses to a constant b0
plus random error.
• The hypotheses (for zero slope and/or intercept) to be tested are:
Hypothesis Tests
Ch
ap
ter 1
2
12.5 Test For SignificanceLO12-5
df = n -2
Reject H0 if tcalc > ta/2
or if p-value a.
12-30
12-31
• The decomposition of variance may be written as
Decomposition of Variance
Ch
ap
ter 1
2
12.6 Analysis of Variance: Overall Fit
LO12-6: Interpret the ANOVA table and use it to calculate F, R2, and
the standard error.
LO12-6
12-32
• To test a regression for overall significance, we use an F test to
compare the explained (SSR) and unexplained (SSE) sums of
squares.
F Test for Overall Fit
Ch
ap
ter 1
2
12.6 Analysis of Variance: Overall Fit
LO12-6: Interpret the ANOVA table and use it to calculate F, R2, and
the standard error.
LO12-6
12-33
12.7 Confidence and Prediction
Intervals for Y
• Confidence Interval for the conditional mean of Y.
• Prediction intervals are wider than confidence intervals because
individual Y values vary more than the mean of Y.
How to Construct an Interval Estimate for Y
Ch
ap
ter 1
2
LO12-7: Distinguish between confidence and prediction
intervals for Y.
LO12-7
12-34
12.8 Residual Tests
Three Important Assumptions
1. The errors are normally distributed.
2. The errors have constant variance (i.e., they are homoscedastic).
3. The errors are independent (i.e., they are nonautocorrelated).
Ch
ap
ter 1
2LO12-8: Calculate residuals and perform tests of
regression assumptions.
Violation of Assumption 1: Non-normal Errors
• Non-normality of errors is a mild violation since the regression
parameter estimates b0 and b1 and their variances remain
unbiased and consistent.
• Confidence intervals for the parameters may be untrustworthy
because normality assumption is used to justify using
Student’s t distribution.
LO12-8
12-35
Non-normal Errors
• A large sample size would compensate.
• Outliers could pose serious problems.
Ch
ap
ter 1
2
Normal Probability Plot
• The Normal Probability Plot tests the assumption
H0: Errors are normally distributed
H1: Errors are not normally distributed
• If H0 is true, the
residual probability
plot should be linear
as shown in the example.
12.8 Residual TestsLO12-8
12-36
What to Do About Non-Normality?
1. Trim outliers only if they clearly are mistakes.
2. Increase the sample size if possible.
3. Try a logarithmic transformation of both X and Y.
Ch
ap
ter 1
2
Violation of Assumption 2: Nonconstant Variance
• The ideal condition is if the error magnitude is constant (i.e.,
errors are homoscedastic).
12.8 Residual TestsLO12-8
12-37
Violation of Assumption 2: Nonconstant Variance
• Heteroscedastic (nonconstant) errors increase or decrease with X.
• In the most common form of heteroscedasticity, the variances of the
estimators are likely to be understated.
• This results in overstated t statistics and artificially narrow
confidence intervals.
Ch
ap
ter 1
2
Tests for Heteroscedasticity
• Plot the residuals against X.
Ideally, there is no pattern in the
residuals moving from left to right.
12.8 Residual TestsLO12-8
12-38
Tests for Heteroscedasticity
• The “fan-out” pattern of increasing residual variance is the most
common pattern indicating heteroscedasticity.
Ch
ap
ter 1
2
12.8 Residual TestsLO12-8
12-39
What to Do About Heteroscedasticity?
• Transform both X and Y, for example, by taking logs.
• Although it can widen the confidence intervals for the coefficients,
heteroscedasticity does not bias the estimates.
Ch
ap
ter 1
2
Violation of Assumption 3: Autocorrelated Errors
• Autocorrelation is a pattern of non-independent errors.
• In a first-order autocorrelation, et is correlated with et-1.
• The estimated variances of the OLS estimators are biased,
resulting in confidence intervals that are too narrow, overstating the
model’s fit.
12.8 Residual TestsLO12-8
12-40
Runs Test for Autocorrelation
• In the runs test, count the number of the residual’s sign reversals (i.e., how
often does the residual cross the zero centerline?).
• If the pattern is random, the number of sign changes should be n/2.
• Fewer than n/2 would suggest positive autocorrelation.
• More than n/2 would suggest negative autocorrelation.
Ch
ap
ter 1
2
Durbin-Watson (DW) Test
• Tests for autocorrelation under the hypotheses
H0: Errors are non-autocorrelated
H1: Errors are autocorrelated
• The DW statistic will range from 0 to 4.
DW < 2 suggests positive autocorrelation
DW = 2 suggests no autocorrelation (ideal)
DW > 2 suggests negative autocorrelation
12.8 Residual TestsLO12-8
12-41
What to Do About Autocorrelation?
• Transform both variables using the method of first differences in
which both variables are redefined as changes. Then we regress Y
against X.
• Although it can widen the confidence interval for the coefficients,
autocorrelation does not bias the estimates.
Ch
ap
ter 1
2
12.8 Residual TestsLO12-8
12-42
12.9 Unusual Observations
Standardized Residuals
• One can use Excel, Minitab, MegaStat or other software to compute
standardized residuals.
• If the absolute value of any standardized residual is at least 2, then it is
classified as unusual.
Ch
ap
ter 1
2LO12-9: Identify unusual residuals and tell when they are
outliers.
LO12-9
12-43
12.9 Unusual Observations
Ch
ap
ter 1
2
High Leverage
• A high leverage statistic indicates the observation is far from the
mean of X.
• These observations are influential because they are at the “ end
of the lever.”
• The leverage for observation i is denoted hi .
LO12-10
LO12-10: Define leverage and identify high leverage
observations.
12-44
High Leverage
• A leverage that exceeds 3/n is unusual.
Ch
ap
ter 1
2
12.9 Unusual ObservationsLO12-10
12-4512B-45
12.10 Other Regression Problems
(optional)
Outliers
To fix the problem,
- delete the observation(s)
- delete the data
- formulate a multiple regression
model that includes the lurking
variable.
Outliers may be caused by
- an error in recording
data
- impossible data
- an observation that has
been influenced by an
unspecified “lurking”
variable that should
have been controlled
but wasn’t.
Ch
ap
ter 1
2
LO12-11
LO12-11: Improve data conditioning and use
transformations if needed (optional).
12-46
Model Misspecification
• If a relevant predictor has been omitted, then the model is
misspecified.
• Use multiple regression instead of bivariate regression.
Ill-Conditioned Data
• Well-conditioned data values are of the same general order of
magnitude.
• Ill-conditioned data have unusually large or small data values and
can cause loss of regression accuracy or awkward estimates.
Ch
ap
ter 1
2
12.10 Other Regression Problems
(optional)LO12-11
12-47
Ill-Conditioned Data
• Avoid mixing magnitudes by adjusting the magnitude of your data
before running the regression.
Spurious Correlation
• In a spurious correlation two variables appear related because of
the way they are defined.
• This problem is called the size effect or problem of totals.
Ch
ap
ter 1
2
12.10 Other Regression Problems
(optional)LO12-11
12-48
Model Form and Variable Transforms
• Sometimes a nonlinear model is a better fit than a linear model.
• Excel offers many model forms.
• Variables may be transformed (e.g., logarithmic or exponential
functions) in order to provide a better fit.
• Log transformations reduce heteroscedasticity.
• Nonlinear models may be difficult to interpret.
Ch
ap
ter 1
2
12.10 Other Regression Problems
(optional)LO12-11