Simple Linear Regression %=! +! (+) · PDF file · 2017-11-16Simple Linear...

8
Simple Linear Regression OI CHAPTER 7 Important Concepts Correlation (r or R) and Coefficient of determination (R 2 ) Interpreting y-intercept and slope coefficients Inference (hypothesis testing and confidence intervals) on y-intercept (! " ), slope (! # ), and mean response ($ % =! " +! # () Prediction for a new value of the response, %=! " +! # (+) Extrapolation = using the model to make predictions for (-values outside the range of observed data Checking regression assumptions Detecting outliers and influential points: leverage (hat values) and Cook’s distance 2 Example: Manatees Least-squares regression line 450 500 550 600 650 700 20 30 40 50 Manatee Deaths vs. Powerboat Registrations Powerboat Registrations (in thousands) Manatee Deaths ˆ y = 41.43 + 0.12 x 3 Linear Regression in R 4 Estimated regression coefficients Interpret Regression Coefficients How do we interpret the intercept? If a year has no powerboat registrations (x = 0), we would expect about -41 manatee deaths…. (not meaningful). Not a valid prediction since the number of powerboat registrations in the data set only ranged from about 450,000 to 700,000, so predicting for x = 0 is extrapolating outside the range of observed data. 5 ˆ y = 41.43 + 0.12 x Interpret Regression Coefficients How do we interpret the slope? For every additional 1000 powerboat registrations, we estimate the average number of manatee deaths would increase by 0.12. That is, for every additional eight thousand powerboat registrations, we predict approximately one (0.12×8) additional manatee death. 6 ˆ y = 41.43 + 0.12 x

Transcript of Simple Linear Regression %=! +! (+) · PDF file · 2017-11-16Simple Linear...

Page 1: Simple Linear Regression %=! +! (+) · PDF file · 2017-11-16Simple Linear Regression OI CHAPTER 7 Important Concepts ... Copyright ©2004 Brooks/Cole, a division of Thomson Learning,

SimpleLinearRegressionOICHAPTER7

ImportantConcepts• Correlation(rorR)andCoefficientofdetermination(R2)

• Interpretingy-interceptandslopecoefficients

• Inference(hypothesistestingandconfidenceintervals)ony-intercept(!"),slope(!#),andmeanresponse($ % = !" + !#()

• Predictionforanewvalueoftheresponse,% = !" + !#( + )• Extrapolation=usingthemodeltomakepredictionsfor(-valuesoutsidetherangeofobserveddata

• Checkingregressionassumptions

• Detectingoutliersandinfluentialpoints:leverage(hatvalues)andCook’sdistance

2

Example: Manatees

Least-squares regression line

450 500 550 600 650 700

2030

4050

Manatee Deaths vs. Powerboat Registrations

Powerboat Registrations (in thousands)

Man

atee

Dea

ths

y = −41.43+ 0.12x

3

Linear Regression in R

4

Estimated regression coefficients

Interpret Regression Coefficients

How do we interpret the intercept?• If a year has no powerboat registrations (x = 0), we

would expect about -41 manatee deaths…. (not meaningful).

• Not a valid prediction since the number of powerboat registrations in the data set only ranged from about 450,000 to 700,000, so predicting for x = 0 is extrapolating outside the range of observed data.

5

y = −41.43+ 0.12x

Interpret Regression Coefficients

How do we interpret the slope?• For every additional 1000 powerboat registrations,

we estimate the average number of manatee deaths would increase by 0.12.

• That is, for every additional eight thousand powerboat registrations, we predict approximately one (0.12×8) additional manatee death.

6

y = −41.43+ 0.12x

Page 2: Simple Linear Regression %=! +! (+) · PDF file · 2017-11-16Simple Linear Regression OI CHAPTER 7 Important Concepts ... Copyright ©2004 Brooks/Cole, a division of Thomson Learning,

Predictions

7

Let x = 500. Then

Interpretation?1. An estimate of the average number of

manatee deaths across all years with 500,000 powerboat registrations is about 19 deaths.

2. A prediction for the number of manatee deaths in one particular year with 500,000 powerboat registrations is about 19 deaths.

y = −41.43+ 0.12(500) =18.57

Residuals (Prediction Errors)• Individual y values can be written as:

y = predicted value + prediction erroror

y = fitted value + residualor

y = + residual• For each observation,

residual = “observed – predicted” =

y

y− y8

Ex: Manatees

Observation (526, 15):• Observed Response = 15• Predicted Response = –41.4304 + 0.1249(526) = 24.3• Residual = 15 – 24.3 = –9.3Observation (559, 34):• Observed Response = 34• Predicted Response = –41.4304 + 0.1249(559) = 28.4• Residual = 34 – 28.4 = 5.6

y = −41.4304+ 0.1249xResidual = Vertical distance from the observed point to the regression line.

450 500 550 600 650 700

2030

4050

Manatee Deaths vs. Powerboat Registrations

Powerboat Registrations (in thousands)

Man

atee

Dea

ths

9

What do we mean by “least squares”?

• Basic Idea: Minimize how far off we are when we use the line to predict y, based on x, by comparing to the actual y.

• Definition: The least squares regression line is the line that minimizes the sum of the squared residualsfor all points in the data set.– The sum of squared “errors” = SSE is that minimum sum:

SSE = yi − yi( )2i=1

n

∑ = (residual)2all  values∑

10

Least squares regression line:

Ex: Manatees

Can compute the residuals for all 14 observations.• Positive residual => observed value higher than predicted.• Negative residual => observed value lower than predicted.

The least squares regression line is such that SSE = (-1.4)2 + (5.0)2 + (5.4)2 + … is as small as possible.

x y Residual

447 13 –41.43 + .1249(447) = 14.4 13 – 14.4 = –1.4

460 21 –41.43 + .1249(460) = 16.0 21 – 16.0 = 5.0

481 24 –41.43 + .1249(481) =18.6 24 – 18.6 = 5.4

y

11

y = −41.4304+ 0.1249xProperties of Correlation: r

• Magnitude of r indicates the strength (how close are the points to the regression line?) of the linear relationship.– Values close to 1 or close to –1 à strong linear

relationship– Values close to 0 à no or weak linear relationship

• Sign of r indicates the direction (when one variable increases, does the other generally increase (positive association) or generally decrease (negative association)) of the linear association.– r > 0 à positive linear association– r < 0 à negative linear association

12

Page 3: Simple Linear Regression %=! +! (+) · PDF file · 2017-11-16Simple Linear Regression OI CHAPTER 7 Important Concepts ... Copyright ©2004 Brooks/Cole, a division of Thomson Learning,

General Guidelines for Describing Strength

Value of r Strength of Linear relationship-1.0 to -0.5 OR 0.5 to 1.0 Strong linear relationship-0.5 to -0.3 OR 0.3 to 0.5 Moderate linear relationship-0.3 to -0.1 OR 0.1 to 0.3 Weak linear relationship

-0.1 to 0.1 No or very weak linear relationship

The table above serves only as a rule of thumb… (many experts may somewhat disagree on the choice of boundaries).

Source: http://www.experiment-resources.com/statistical-correlation.html

13

Manatees

• r = 0.941• Very strong

positive linear association

14

450 500 550 600 650 700

2030

4050

Manatee Deaths vs. Powerboat Registrations

Powerboat Registrations (in thousands)

Num

ber o

f Man

atee

Dea

ths

Measuring Strength of Linear Association• The following scatterplots are arranged from strongest positive

linear association (on the left) to those with virtually no linear association (in the middle) to those with the strongest negative linear association (on the right).

• Take a couple minutes and try to guess the correlation for each plot.

Positive ß None à Negative

Strong ß Weak Weak à Strong

15

Measuring Strength of Linear Association

Get more practice at guessing the correlation and impress your friends here: http://www.rossmanchance.com/applets/GuessCorrelation.html

.994 .889 .510 -.081 -.450 -.721 -.907

16

Formula for r (p. 338 footnote)

We will have R calculate for us! But we can still learn from the formula…• We’re looking at an (almost) average of the

products of the standardized (z) scores for x with each respective standardized (z) score for y. What does this mean? (Draw picture)

å ÷÷ø

öççè

æ -÷÷ø

öççè

æ --

=y

i

x

i

syy

sxx

nr

11

17

Formula for r

The formula also shows us that:• Order of x and y doesn’t matter.

– It doesn’t matter which of the two variables is called the x variable, and which is called the y variable, the correlation doesn’t care.

• Correlation is unitless.– Doesn’t change when the measurement units are

changed (since it uses standardized observations in its calculation).

18

å ÷÷ø

öççè

æ -÷÷ø

öççè

æ --

=y

i

x

i

syy

sxx

nr

11

Page 4: Simple Linear Regression %=! +! (+) · PDF file · 2017-11-16Simple Linear Regression OI CHAPTER 7 Important Concepts ... Copyright ©2004 Brooks/Cole, a division of Thomson Learning,

x = Month (January = 1, February = 2, etc.)y = Raleigh’s average monthly temperature

Even though the strength of the relationship between month and rainfall is very strong, its correlation is only 0.257, indicating a weak linear relationship.

r = .257

Warning! The correlation coefficient only measures the strength and direction of a linear association.

19

R-squared (coefficient of determination): r2

Squared correlation r2 is between 0 and 1and indicates the proportion of variation in the response (y) �explained� by knowing x.SSTO = sum of squares total = sum of squared differences between observed y values and .

We will break SSTO into two pieces, SSE + SSR:

SSE = sum of squared residuals (error), unexplained

SSR = sum of squares due to regression or explained= sum of squared differences between fitted values and .

y

yCopyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc., modified by S. Hancock Oct. 2012

y20

New interpretation: r2

SSTO = SSR + SSEQuestion: How much of the total variability in the y values (SSTO) is �explained� by the regression model (SSR)? How much better can we predict y when we know x than when we don’t?

r2 = SSRSSR+SSE

=SSRSSTO

=1− SSESSTO

Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc., modified by S. Hancock Oct. 2012

21

Example: Chug Timesx = body weight (lbs); y = time to chug a 12-oz drink (sec)Total variation summed over all points = SSTO = 36.6Unexplained part summed over all points = SSE = 13.9Explained by knowing x summed = SSR = 22.7

240220200180160140120

9

8

7

6

5

4

3

2

Weight

Chug

Tim

e

5.108 = mean y

Scatterplot of ChugTime vs Weightr2 =1− SSE

SSTO=SSRSSTO

=22.736.6

= 0.62

Interpretation: 62% of the variability in chug times is explained by knowing the weight of the person.

22

Breakdown of Calculations

x y153 5.6 (5.6-5.11)2 = 0.24 6.29 (6.29-5.11)2 = 1.40 (5.6-6.29)2 = 0.48 169 6.1 (6.1-5.11)2 = 0.98 5.56 (5.56-5.11)2 = 0.20 (6.1-5.56)2 = 0.29 178 3.3 (3.3-5.11)2 = 3.27 5.15 (5.15-5.11)2 = 0.002 (3.3-5.15)2 = 3.41 … … … … … …158 6.7 (6.7-5.11)2 = 2.53 6.06 (6.06-5.11)2 = 0.91 (6.7-6.06)2 = 0.41

SUM: 66.4 SSTO = 36.61 66.4 SSR = 22.73 SSE = 13.88

y = 66.4 /13= 5.11 y =13.298− 0.046x

y− y( )2 y− y( )2y− y( )2 y

SSTO = SSR + SSE => 36.61 = 22.73 + 13.88Note: SSTO does not have anything to do with the fitted model nor with x; it’s just a measure of the variability in y (used in calculating standard deviation of y-values).

23

Example: Manatees

• r = 0.94 à r2 = 0.89• Interpretation: About 89% of the variability in

the number of manatee deaths (response variable) can be explained by the number of powerboat registrations (explanatory variable).

r2 in R output

24

Page 5: Simple Linear Regression %=! +! (+) · PDF file · 2017-11-16Simple Linear Regression OI CHAPTER 7 Important Concepts ... Copyright ©2004 Brooks/Cole, a division of Thomson Learning,

REGRESSION DIAGNOSTICS

What to check for?• Simple linear regression model assumptions:

1. Linearity2. Constant variance3. Normality4. Independence

• Outliers? Influential points?• Other predictors?

26

Scatterplots with lowess curveThe best plot to start with is a scatterplot of Y vs. X.

• Add the least-squares regression line.• Add a smoothed curve (not restricted to a line –follows the general pattern of the data)• lowess curve = “locally weighted regression

scatterplot smoothing”• Similar to a “moving average”• Fits least-squares lines around each “neighborhood”

of points, makes predictions, and smooths the predictions.

• R function: lines(lowess(y~x))

27

Residuals vs. Fitted Values(Or residuals vs. X)• Check linearity and constant variance assumptions, and look for outliers.

• Good:• No pattern; random scatter.• Equal spread around the horizontal line at zero (mean

of the residuals).• Bad:

• Pattern (curved, etc.) à indicates linearity not met.• Funneling (spread increases or decreases with X) à

indicates non-constant variance.

28

Residuals vs. Fitted Values - GOOD

Points look randomly scattered around zero. No evidence of nonlinear pattern or unequal variances.

29

Residuals vs. Fitted Values - BAD

• Plot on the left shows evidence of non-constant variance.• Plot on the right shows evidence of nonlinear relationship.

30

Page 6: Simple Linear Regression %=! +! (+) · PDF file · 2017-11-16Simple Linear Regression OI CHAPTER 7 Important Concepts ... Copyright ©2004 Brooks/Cole, a division of Thomson Learning,

Residuals vs. Time or Obs. Number• If data is available on the time observations were collected, a plot of residuals vs. time can help us check the independence assumption.

• Good: No pattern over time.• Bad: Pattern (e.g., increasing/decreasing or cyclical) over time à indicates dependence in data points (points closer together in time more similar than points further apart in time).

31

Independence Assumption• In general, the independence assumption is checked by knowledge of how the data were collected.

• Example: Suppose we want to predict blood pressure from amount of fat consumed in the diet.• Incorrect sampling à sample entire families àresults in dependent observations. Why?

• Use a random sample to ensure independence of observations.

32

Residuals vs. Other Predictor• Check if the other predictor helps explain the leftovers from our model.

• If there is a pattern (e.g., increasing with values of other predictor)…, • may indicate that this other predictor helps predict Y

in addition to original X• try adding it to the model (multiple linear regression à introduced in Chapter 6)

33

Normal Probability Plot (Normal Quantile-Quantile Plot) of Residuals• Plots the sample quantiles (y-axis) vs. quantiles we would expect from a standard normal distribution (x-axis).

• Best plot to use for checking normality assumption.

• Good: Points follow a straight line.• Bad: Points deviate in a systematic fashion from a straight line à indicates violation of normality assumption.

34

Normal Probability Plots of ResidualsExamples of bad and good normal probability plots (see also Figure 3.9 on p. 112)

35

What to do if assumptions are not met?

Two choices:1. Abandon simple linear regression model and

use a more appropriate model (future courses).2. Employ some transformation on the data so that

the simple linear regression model is appropriate for the transformed data.

• Can transform X or Y or both.• Make sure to run regression diagnostics on the model

fit with transformed data to check assumptions.• Can make interpretations more complex.

36

Page 7: Simple Linear Regression %=! +! (+) · PDF file · 2017-11-16Simple Linear Regression OI CHAPTER 7 Important Concepts ... Copyright ©2004 Brooks/Cole, a division of Thomson Learning,

Transformations• Nonlinear pattern with constant error variance àTry transforming X

37

Residuals are inverted U à use • Upper plots show data and residual plot before

transformation; lower plots show after.

X ' = X   or   X ' = log(X)38

Residuals are U-shaped and association between X and Y is positiveà use X ' = X 2   or   X ' = exp(X)

39

Residuals are U-shaped and association between X and Y is negativeà use X ' =1/ X   or   X ' = exp(−X)

40

Transformations• Non-constant error variance à Try transforming Y

41

May also help to transform X in addition to Y.

Box-Cox Transformations• “Power transformation” on Y à Transformed Y =

• Most commonly used:Y ' =Y λ

λ = 2     →     Y ' =Y 2

λ = 0.5  →     Y ' = Yλ = 0     →     Y ' = log(Y )   (by definition)

λ = −0.5→     Y ' =1/ Yλ = −1    →     Y ' =1/Y

Remember: Notation “log” means log base e

42

Page 8: Simple Linear Regression %=! +! (+) · PDF file · 2017-11-16Simple Linear Regression OI CHAPTER 7 Important Concepts ... Copyright ©2004 Brooks/Cole, a division of Thomson Learning,

Box-Cox Transformations• Find maximum likelihood estimate of λ.• R function to find best λ = boxcox

• Need to load the MASS R library.• Plots likelihood function à look at value of λ where

likelihood is highest.• Best to choose a value of λ that is easy to interpret

(e.g., choose λ = 0 rather than λ = 0.12)

43

Detecting Outliers and Influential Points• Outlier = observation that does not follow the overall

pattern of the data set à easy to see in scatterplots for SLR; harder to find in multiple linear regression

• Influential point = observation that, when removed, changes the model fit substantially (e.g., coefficient estimates, correlation)

• Tools for detecting outliers and influential points:• Residual plots• Leverage = measures how far away an observation is from the

mean of the predictors à high leverage = potential to be an influential point

• Cook’s distance = measures how much of an influence an individual observation has on the fitted coefficients (how much they change when the observation is removed) à high Cook’s distance = influential point

44