Chapter 11. Linear Regression and Correlation

Post on 05-May-2022

18 views 1 download

Transcript of Chapter 11. Linear Regression and Correlation

Chapter 11. Linear Regression and Correlation

Prediction vs. Explanation

• Prediction– Reference to future values

• Explanation– Reference to current or past values

• For prediction (or explanation) to make much sense, there must be some connection between the variable we’re predicting (the dependent variable) and the variable we’re using to make the prediction (the independent variable).

Simple Regression (1)

• There is a single independent variable and the equation for predicting a dependent variable y is a linear function of a given independent variable x.

• For example, a prediction equation is a linear equation. The constant term, such as the 2.0, is the intercept term and is interpreted as the predicted value of y when x =0. The coefficient of x, such as 3.0, is the slope of the line, the predicted change in y when there is a one-unit change in x.

xy 0.30.2ˆ +=

Simple Regression (2)

• The formal assumptions of regression analysis include:– linearity

• The relation is, in fact, linear, so that the errors all have expected value zero: for all i.

– equal variance• The errors all have the same variance: for all i.

– independence• The errors are independent of each other.

– normality• The errors are all normally distributed; is normally distributed for all i.

0)( =iεE

2)( ii σεVar =

Scatterplot

• To check the assumptions of regression analysis, it is important to look at a scatterplot of the data. This is simply a plot of each (x, y) point, with the independent variable value on the horizontal axis, and the dependent variable value measured on the vertical axis.

Smoothers (1)

• Smoothers have been developed to sketch a curve through data without necessarily assuming any particular model.

• If such a smoother yields something close to a straight line, then linear regression is reasonable.

• One such method is called LOWESS (locally weighted scatterplotsmoother). Roughly, a smoother takes a relatively narrow “slice” of data along the x axis, calculates a line that fits the data in that slice, moves the slice slightly along the x axis, recalculates the line, and so on. Then all the little lines are connected in a smooth curve. The width of the slice is called the bandwidth; this may often be controlled in the computer program that does the smoothing.

Smoothers (2)

Smoothers (3)

• Another type of scatterplot smoother is the spline fit.– It can be understood as taking a narrow slice of data, fitting a curve (often a

cubic equation) to the slice, moving to the next slice, fitting another curve, and so on.

– The curves are calculated in such a way to form a connected , continuous curve.

• If the scatterplot does not appear linear, by itself or when fitted with a LOWESS curve, it can often be “straightened out” by a transformation of either the independent variable or the dependent variable.

• Several transformations of the independent variables can be tried to find a more linear scatterplot.

– Three common transformations are square root, natural logarithm, and inverse (one divided by the variable).

– Finding a good transformation often requires trial and error.

Smoothers (4)

Case Study: Comparison of Two Methods for Detecting E. coli

• The researcher wanted to evaluate the agreement of the results obtained using the HEC test with results obtained from an elaborate laboratory-based procedure, hydrophobic grid membrane filtration (HGMF). The HEC test is easier to inoculate, more compact to incubate, and safer to handle than conventional procedures.

Linear Regression --- Estimating Model Parameters

• The intercept β0 and β1 in the regression model are population quantities.

• We must estimate these values from sample data. The error variance is another population parameter that must be estimated.

• The first regression problem is to obtained estimates of the slope, intercept, and variance.

• The first step in examining the relation between y and x is to plot the data as a scatterplot.

εxββy ++= 10

2εσ

Least-squares Method (1)

• The regression analysis problem is to find the best straight-line prediction.

• The most common criterion for “best” is based on squared prediction error.

• We find the equation of the prediction line---that is, the slope and intercept that minimize the total squared prediction error.

• The method that accomplishes this goal is called the least-squares method because it chooses and to minimize the quantities.

0β1β

1β 0β

∑ ∑ +−=−i i

iiii xββyyy 210

2 )]ˆˆ([)ˆ(

Least-squares Method (2)

Least-squares Method (3)

• The least-squares estimates of slope and intercept are obtained as follows:

where

xx

xy

SS

β =1ˆ

xβyβ 10ˆˆ −=

∑−=

−−=

iixx

iiixy

xxS

yyxxS

2)(

))((

Least-squares Method (4)• The estimate of the regression slope can potentially be greatly affected

by high leverage point. These are points that have very high or very low values of the independent variable --- outliers in the x direction. They carry great weight in the estimate of the slope.

• A high leverage point that also happens to correspond to a y outlier is a high influence point. It will alter the slope and twist the line badly.

• A point has high influence if omitting it from the data will cause the regression line to change substantially.

• A high leverage point indicates only a potential distortion of the equation. Whether or not including the point will “twist” the equation depends on its influence (whether or not the point falls near the line through the remaining points). A point must have both high leverage and outlying y value to qualify as a high influence point.

Least-squares Method (5)

Least-squares Method (6)

• Most computer programs that perform regression analyses will calculate one or another of several diagnostics measures of leverage and influence.

• Very large values of any of these measures correspond to very high leverage or influence points.

• The distinction between high leverage (x outlier) and high influence (xoutlier and y outlier) points is not universally agreed upon yet.

Least-squares Method (7)

• The standard error of the slope is calculated by all statistical packages. Typically, it is shown in output in a column to the right of the coefficient column.

• Like any standard error, it indicates how accurately one can estimate the correct population or process value.

• The quality of estimation of is influenced by two quantities: the error variance and the amount of variation in the independent variable :

1β2εσ xxS

xx

εβ S

σσ =1

ˆ

Least-squares Method (8)

• The greater the variability of the y value for a given value of x, the larger is. Sensibly, if there is high variability around the regression line, it is difficult to estimate that line.

• Also, the smaller the variation in x values (as measured by Sxx), the larger is. The slope is the predicted change in y per unit change in x; if x changes very little in the data, so that Sxx is small, it is difficult to estimate the rate of change in y accurately.

• The standard error of the estimated intercept is influenced by n, naturally and also by the size of the square of the sample mean, , relative to Sxx.

εσ

1βσ

1βσ

0β2x

xxεβ S

xn

σσ2

ˆ1

0+=

Least-squares Method (9)

• The estimate of based on the sample data is the sum of squared residuals divided by n-2, the degrees of freedom. The estimated variance is often shown in computer output as MS(Error) or MS(Residual).

• Recall that MS stands for “mean square” and is always a sum of squares divided by the appropriate degrees of freedom:

2εσ

2)(

2

)ˆ( 2

2

−=

−=∑

nresidualSS

n

yyS i

ii

ε

Least-squares Method (10)

• The square root of Sε of the sample variance is called the sample standard deviation around the regression line, the standard error of estimate, or the residual standard deviation.

• Because Sε estimates σε, the standard deviation of yi, σε estimates the standard deviation of the population of y values associated with a given value of the independent variable x.

Example 11.4

Answers to Example 11.4 (1)

Answers to Example 11.4 (2)

Answers to Example 11.4 (3)

Inferences about Regression Parameters (1)

• The t distribution can be used to make significance tests and confidence intervals for the true slope and intercept.

• One natural null hypothesis is that the true slope β0 equals to 0. If this H0 is true, a change in x yields no predicted change in y, and it follows that x has no value in predicting y.

• The sample slope has the expected value β1 and standard error

xxεβ Sσσ 1

1ˆ =

Inferences about Regression Parameters (2)

• In practice, σε is not known and must be estimate by Sε, the residual standard deviation. In almost all regression analysis computer outputs, the estimated standard error is shown next to the coefficient. A test of this null hypothesis is given by the t statistic

• The most common use of this statistic is shown in the following summary:

xxε SSββ

βββt

/1

ˆ

)ˆerror( standard estimated

ˆ11

1

11

⋅−

=−

=

Inferences about Regression Parameters (3)

• It is also possible to calculate a confidence interval for the true slope. This is an excellent way to communicate the likely degree of inaccuracy in the estimate of that slope. The confidence interval once again is simply the estimate plus or minus a t table value times the standard error.

• The required degrees of freedom for the table value tα/2 is n-2, the error df.

xxεα

xxεα S

StββS

Stβ 1ˆ1ˆ21121 +≤≤−

Inferences about Regression Parameters (4)

• There is an alternative test, an F test, for the null hypothesis of no predictive value. It was designed to test the null hypothesis that all predictors have no values in predicting y. This test gives the same result as a two-sided t test of H0: β1 = 0 in simple linear regression; to say that all predictors have no value is to say that the (only) slope is 0. The F test is summarized below.

Inferences about Regression Parameters (5)

• The comparable hypothesis testing and confidence interval formulas for the intercept β0 using the estimate standard error of as is:

• In practice, this parameter is of less interest than the slope. In particular, there is often no reason to hypothesize that the true intercept is zero (or any other particular value). Computer packages almost always test the null hypothesis of zero slope, but some don’t bother with a test on the intercept term.

xxεβ S

xn

Sσ2

ˆ1

0+=

Predicting New y Values Using Regression (1)

• There are two possible interpretation of a y prediction based on a given x. Suppose that the highway directors substitutes x = 6 miles in the regression equation and gets = 20. This can be interpreted as either.

– The average cost E(y) of all resurfacing contracts for 6 miles of road will be $20,000; or

– The cost y of this specific resurfacing contract for 6 miles of road will be $20,000.

• The best-guess predicting in either case is 20, but the plus or minus factor differs. It is easier to predict an average value E(y) than an individual yvalue, so the plus or minus factor should be less for predicting an average.

xy 0.30.2ˆ += y

Predicting New y Values Using Regression (2)

• In the mean-value forecasting problem, the standard error of can be shown to be:

• Here Sxx is the sum of squared deviations of the original n values of xi; it can be calculated from most computer outputs as

1ˆ +ny

xx

nε S

xxn

σ2

1 )(1 −+ +

2

1)ˆerror( standard ⎟⎟⎠

⎞⎜⎜⎝

βSε

Predicting New y Values Using Regression (3)

• The forecasting plus or minus term in the confidence interval for E(yn+1) depends on the sample size n and the standard deviation around the regression line, as one might expect.

• It also depends on the squared distance of xn+1 from (the mean of the previous xi values) relative to Sxx.

• As xn+1 gets farther from , the term get larger.

• When xn+1 is far away from the other x values, so that this term is large, the prediction is a considerable extrapolation from the data.

• Small errors in estimating the regression line are magnified by the extrapolation. The term could be called an extrapolation penalty because it increases with the degree of extrapolation.

x

x xxn Sxx 21 )( −+

xxn Sxx 21 )( −+

Predicting New y Values Using Regression (4)

• Extrapolation---predicting the results at independent variable values far from the data---is often tempting and always dangerous. Using it requires an assumption that the relation will continue to be linear, far beyond the data.

• The extrapolation penalty term actually understates the risk of extrapolation. It is based on the assumption of a linear relation, and that assumption gets very shaky for large extrapolations.

• The confidence and prediction intervals also depend heavily on the assumption of constant variance.

Predicting New y Values Using Regression (5)

• Usually, the more relevant forecasting problem is that of predicting an individual yn+1 value rather than E(yn+1). In most computer packages, the interval for predicting an individual values is called a prediction interval. The same best guess is used , but the forecasting plus or minus term is larger when predicting yn+1 than E(yn+1).

• In fact, it can be shown that the plus or minus forecasting error using to predict yn+1 is as follows.

• Prediction interval for yn+1

1ˆ +ny

1ˆ +ny

df. 2-on with distributi theof right tail in the /2 area off cuts where

)(11ˆ)(11ˆ

2

21

211

21

21

ntαtS

xxn

StyyS

xxn

Sty

α

xx

nεαnn

xx

nεαn

−+++≤≤

−++− +

+++

+

Predicting New y Values Using Regression (6)• The only difference between prediction of a mean E(yn+1) and prediction of an

individual yn+1 is the term +1 in the standard error formula. The presence of this extra term indicates that predictions of individual values are less accurate than predictions of means.

• If n is large and the extrapolation term is small, the +1 term dominates the square root factor in the prediction interval.

Examining Lack of Fit in Linear Regression (1)

• We have been concerned with how well a linear regression model fits, but only from an intuitive.

• We could examine a scatterpolt of the data to see whether it looked linear and we could test whether the slope differed from 0; however, we had no way of testing to see whether a higher-order model would be a more appropriate model for the relationship between y and x.

• Pictures (or graphs) are always a good starting point for examining lack of fit.– First, use a scatterplot of y versus x.– Second, a plot of residuals versus predicted values may give an indication of the

following problems:• outliers or erroneous observations. In examining the residual plot, your eye will

naturally be drawn to data points with unusually high residuals.• Violation of the assumptions. For the model , we have assumed a linear

between y and dependent variable x, and independent, normally distributed errors with a constant variance.

εxββy ++= 10

ii yy ˆ− iy

εxββy ++= 10

Examining Lack of Fit in Linear Regression (2)

• The residual plot for a model and data set that has none of these apparent problems would look much like the plot below.

• When a higher-order model is more appropriate, a residual plot more like the following plot.

Examining Lack of Fit in Linear Regression (3)

• A check of the constant variance assumption can be addressed in the yversus x scatterplot or with a plot of the residuals versus xi.

• Homogeneous error variance across values of x

• The error variances increase with increasing values of x

)ˆ( ii yy −

Examining Lack of Fit in Linear Regression (4)

• When there is more than one observation per level of the independent variable, we can conduct a test for lack of fit of the fitted model by partitioning SS (residuals) into two parts, one pure experimental error and the other lack of fit.

• Let yij denote the response for the jth observation at the ith level of the independent variable. Then, if there are ni observations at the ith level of the independent variable, the quantity

provides a measure of what we will call pure experimental error. This sum of squares has ni-1 degrees of freedom.

• Similarly, for each of the other levels of x, we can compute a sum of squares due to pure experimental error. The pooled sum of squares

∑ −j

iij yy 2)(

∑ −=ij

iij yySSP 2exp )(

Examining Lack of Fit in Linear Regression (5)

called the sum of squares for pure experimental error, has degrees of freedom. With SSlack representing the remaining portion of SSE, we have

• If SS(residuals) is based on n-2 degrees of freedom in the linear regression model, then SSlack will have df = n-2- .

• Under the null hypothesis that our model is correct, we can form independent estimates of , the model error variance, by dividing SSPexp and SSlack by their respective degrees of freedom; these estimates are called mean squares and are denoted by MSPexp and MSlack, respectively.

∑ −i

in )1(

fit lack to todue error alexperiment pure todue )( exp lackSSSSPresidualsSS +=

∑ −i

in )1(

2εσ

Examining Lack of Fit in Linear Regression (6)

• The test for lack of fit is summarized here.

• Conclusion– If the F test is significant, this indicates that the linear regression model is

inadequate.– A nonsignificant result indicates that there is insufficient evidence to suggest that

the linear regression model is inappropriate.

The Inverse Regression Problem (Calibration) (1)

• In experimental situations, we are often interested in estimating the value of the independent variable corresponding to a measured value of the dependent variable.

• The most commonly used estimate is found by replacing by y and solving the least-squares equation for x:

• Two different inverse prediction problems will be discussed here.– The first for predicting x corresponding to an observed value of y; second is for

predicting x corresponding to the mean of m > 1 values of y that were obtained independent of the regression data.

yxββy 10

ˆˆˆ +=

1

0

ˆˆ

ˆββyx −

=

The Inverse Regression Problem (Calibration) (2)

• Predicting x based on an observed y-value

• The greater the strength of the linear relationship between x and y, the larger the quantity (1-c2), making the width of the predicting interval narrower.

The Inverse Regression Problem (Calibration) (3)

• Predicting x based on m y-values

Analyzing Data from the E. coli Concentrations Case Study (1)

Analyzing Data from the E. coliConcentrations Case Study (2)

Analyzing Data from the E. coliConcentrations Case Study (3)

Analyzing Data from the E. coliConcentrations Case Study (4)

• The width of the 95% prediction intervals was slightly less than one unit for most values of HEC. Thus, HEC determination in the field of E. coli concentrations in the -1 to 2 range would result in 95% prediction intervals for the corresponding HGMF determinations.

• This degree of accuracy would not be acceptable. One way to reduce the width of the intervals would be to conduct an expanded study involving considerably more observations than the 17 obtained in this study.

Correlation (1)

• Correlation coefficient– This proportionate reduction in error is closely related to the correlation

coefficient of x and y. A correlation measures the strength of the linear relation between x and y.

– The stronger the correlation, the better x predicts y.– Given n pairs of observations (xi, yi), we compute the sample correlation r as:

yyxx

xy

yyxx

iiyx SS

SSS

yyxxr =

−−= ∑ ))((

)total()( 2 SSyySi

iyy =−=∑

Correlation (2)

• Coefficient of determination– Correlation and regression predictability are closely related. The

proportionate reduction in error for regression we defined earlier is called the coefficient of determination.

– The coefficient of determination is simply the square of the correlation coefficient.

which is the proportionate reduction in error.

)total()residual()total(2

SSSSSSryx

−=

Correlation (3)

• Assumptions for correlation inference– The assumptions of regression analysis---linear relation between x and y

and constant variance around the regression line, in particular---are also assumed in correlation inference.

– In regression analysis, we regard the x values as predetermined constants.– In correlation analysis, we regard the x values as randomly selected (and

the regression inferences are conditional on the sampled x values).– If the xs are not drawn randomly, it is possible that the correlation

estimates are biased. In some texts, the additional assumption is made that the x values are drawn from a normal population. The inferences we make do not depend crucially on this normality assumption.

– The most basic inference problem is potential bias in estimation of .– The choice of x values can systematically increase or decrease the sample

correlation.– In general, a wide range of x values tends to increase the magnitude of the

correlation coefficient and a small range to decrease it.

yxρ

Correlation (4)

• Correlation coefficients can be affected by systematic choices of x values; the residual standard deviation is not affected systematically, although it may change randomly if part of the x range changes.

• Thus, it is a good idea to consider the residual standard deviation and the magnitude of the slope when you decide how well a linear regression line predicts y.

εS

Summary of a Statistical Test for yxρ

Example 11.16 (1)

Example 11.16 (2)

Example 11.16 (3)

Example 11.16 (4)

Example 11.16 (4)

• We would reject the null hypothesis at any reasonable α level, so the correlation is “statistically significant.” However, the test account for only 0.035 of the squared error in the dependent variable, so it is almost worthless as a predictor.

• Remember, the rejection of the null hypothesis in a statistical test is the conclusion that the sample results cannot plausibly have occurred by chance if the null hypothesis is true.

• The test itself does not address the practical significance of the result. Clearly, for a sample size of 40,000, even a trivial sample correlation like 0.035 is not likely to occur by mere luck of the draw. There is no practically meaningful relationship between the dependent and independent variables in this example.