The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate...

35
Today's Agenda r 2 , the coefficient of determinaon The bivariate normal assumpon Diagnosc plots: Residuals and Cook's Distance R output (moved to week 3), Syllabus note: We are ahead of schedule in regression, so we're taking the me to add more examples and details, like Cook's distance and residuals. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 1

Transcript of The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate...

Page 1: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Today's Agenda

r2, the coefficient of determination

The bivariate normal assumption

Diagnostic plots: Residuals and Cook's Distance

R output (moved to week 3),

Syllabus note: We are ahead of schedule in regression, so we're taking the time to add more examples and details, like Cook's distance and residuals.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 1

Page 2: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

r2, the coefficient of determination

r2 is simply the Pearson correlation coefficient r, but squared.

So why all the fuss about it?

When x and y are correlated, we say that some of the variation in y is explained by x.

The proportion explained is r2 . It is called the coefficient of determination because it represents how well a value of y can be determined by x.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 2

Page 3: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Abstract case 1:

If there was a perfect correlation between x and y,

then the relationship between them could be described

perfectly by a line. ( r = -1 or +1)

Once you have the regression equation, knowing x allows you determine what y is, and without any error.

In these cases, r2 is 1, meaning that 100% of the variance in yis explained by x.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 3

Page 4: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Abstract case 2:

If there was NO correlation between x and y, such that r=0,

then there is no linear relationship between x and y.

Knowing x and using the regression equation of that (lack of)relationship would tell you literally nothing about y.

In these cases, r2 is 0, so none of the variance in y is explained by x.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 4

Page 5: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Medical example:

On page 4 of 8 of this paper, Pak J Physiol 2010;6(1):

http://www.pps.org.pk/PJP/6-1/Talay.pdf

... there are several scatterplots describing the correlation between resting heart rate (RHR) and several other possibly related variables.

Consider the first scatterplot, called Figure 1A. In this figure, a regression of body-mass index (BMI, y) as a function of resting heart rate (RHR, x) is shown.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 5

Page 6: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Scatterplot of Heart Rate (x) and Body-Mass Index (y)

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 6

Page 7: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Here, the sample correlation is r = 0.305, and there is strong evidence that the population correlation is positive because p < 0.01.

r2 = 0.3052 = 0.0930,

so 9.3% of the variation in BMI can be explained by RHR.

Also, 9.3% of the variation in RHR can be explained by BMI.

Why?

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 7

Page 8: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Correlation works in both directions.

In Figure 1b, the sample shows that some variation in Waist-to-Hip Ratio (WHR) is explained by (and explains) RHR.

0.2302 = 0.0529 or 5.3% of the variation.Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 8

Page 9: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

If Body-Mass Index explains 9.3% of the variation of RHR, and

Waist-to-Hip Ratio explains 5.3% of the variation,

could they together explain 9.3 + 5.3 = 14.6% ?

Sadly, no.

Since BMI and WHR are measuring very similar things, there is going to be a lot of overlap in the variation that they explain.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 9

Page 10: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

But what is this 'variation'? Let's dig deeper!

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 10

Page 11: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Recall that the regression equation without the error term,

α + βx , is called the least squares line.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 11

Page 12: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

The 'squares' being referred to are the squared errors.

Mathematically, it is the line through the data that produces the smaller sum of squared error (SSE), which is

where epsilon ε is the error term that we ignored earlier:

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 12

Page 13: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

The sum of squares error SSE is the amount of variation that is left unexplained by the model.

We used squared errors because...

- Otherwise negative and positive errors would cancel.

- This way, the regression equation will favour creating many small errors instead of one big one.*

- In calculus, the derivative of x2 is easy to find.

* Also why Pearson correlation is sensitive to extreme values.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 13

Page 14: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

The error term is in any model we use, even the null model, which is a fancy term for not regressing at all.

Or

In the null model, every value of y is predicted to be the average of all observed y values. So α is the sample mean of y, y-bar.Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 14

Page 15: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

The total squared difference from the mean of y is called the

sum of squares total, or SST

SST is the total square length of all the vertical red lines.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 15

Page 16: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

If we fit a regression line, (most of the) errors become smaller.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 16

Page 17: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Most importantly, the squared errors get smaller. The coefficient of determination, r2, is measuring how much smaller the squared errors get.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 17

Page 18: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Here, the correlation is very strong ( r is large), and there arebarely and errors at all.

So SSError would be much smaller than SSTotal,

and r2 is also largeStat 302, Winter 2016 SFU, Week 2, Hour 3, Page 18

Page 19: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

The relationship between r2, SSE, and SST is:

SST is the total amount of variation in Y

SSE is the amount of variation in Y left unexplained by X.

When r2 is zero, SSE is same as SST

When r2 is one, SSE disappears completely.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 19

Page 20: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

So we now have two different interpretations of r-squared.

1. The square of the correlation efficient.

2. The proportion of Sum of Squares Total (SST) that is removed from the error term.

Interpretation #1 is specific to correlation.

Interpretation #2 works for simple regression, but also for AnOVa, multiple regression, general linear models!

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 20

Page 21: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

R-squared is truly the go-anywhere animal.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 21

Page 22: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Bivariate Normality (and some diagnostics)

Regression produces a line that minimizes sum squared errors, so a small number of extreme values (outliers) can have a strong effect on a model.

Consider this Pearson r:

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 22

Page 23: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

More specifically, regression is sensitive to violations of the assumption of bivariate normality.

The regression model assumes:

1. The distributions of the x and y variables is normal.

If you were to take a histogram of all the x values, that histogram should resemble a normal curve.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 23

Page 24: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

The regression model also assumes:

2. The distribution of y, conditional on x, is normal.

If you were to take a histogram of all the error terms, that histogram should ALSO resemble a normal curve.

Any observations that produce errors that are too large to bein the curve are potentially influential outliers.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 24

Page 25: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

In this diagram, the red line is the regression on all 54 points.

The blue line is the regression without the 4 red points.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 25

Page 26: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

These points are near the lower end of x, and have very large error terms associated with them, so they 'pull' the leftend of the regression line down.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 26

Page 27: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Another word for these errors is residuals, literally the residue, or portion left over from the model. Here is a scatterplot of the residuals over x. A.K.A, a residual plot.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 27

Page 28: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

The outliers are clearly visible from the residual plot, and from the histogram below. Their values are twice as large as any other observation.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 28

Page 29: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

One way to measure how much an outlier is affecting the model is to remove that one point and see how much the model changes.

We can see a big difference between the blue and red lines above, but that is a comparison by removing 4 points manually.

Another, more systematic (and therefore quick, easy, and often more reliable) method is to remove one observation ata time and see how much the model changes.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 29

Page 30: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Cook's distance is a regression deletion diagnostic. It works by comparing a model with every observation to one with only the observation in question removed / deleted.

The higher Cook's distance is for a value, the more that particular value is influencing the model.

If there are one or two values that are having undue leverage on the model, Cook's distance will find them. This istrue even if the residual plot fails to find them (which it can if the observation is 'pulling' hard enough)

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 30

Page 31: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

This is Cook's Distance for all 54 data points.

Note that although all 4 problem points have high Cook's distance compared to the rest, two of them are not obvious problems. Cook's distance has a hard time identifying influential observes when there are several.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 31

Page 32: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Dealing with outliers is like selecting an acceptable Type I error. There are conventions and guidelines in place, but it isa case-by-case judgement call.

One question to ask is “does this observation belong in my data set?”, when considering things other than your model.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 32

Page 33: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

If the outlier is the result of a typo, it's not the same as the rest of your sample and it should go.

If other information about that observation is nonsense, such as joke answers in a survey, then that's also justificationto remove that outlier observation.

If it just happens to be an extreme value, but otherwise everything seems fine with it, then it is best to keep it.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 33

Page 34: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Don't rush to finish your model. Look for outliers first.

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 34

Page 35: The bivariate normal assumption Diagnostic plots ...jackd/Stat302/Wk02-3_Full.pdf · The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved

Next Tuesday:

- Diagnostics and Regression in R.- Correlation vs CausalityRead: Rubin on Causality, only Sections 1-3 for next Tuesday.

Sources: xkcd.com/605 My Hobby: Extrapolating.Sand Crab Photo, by Regiane Cardillo, Brasilhttp://www.pps.org.pk/PJP/6-1/Talay.pdf , Pak. J. Phisol. (2010) 6:1Mandarin Duck and Parrot on Tortoise unknown

Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 35