Chapter 3: Scatterplots and Correlation AP STATISTICS.

56
Chapter 3: Scatterplots and Correlation AP STATISTICS

Transcript of Chapter 3: Scatterplots and Correlation AP STATISTICS.

Page 1: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Chapter 3:Scatterplots and Correlation

AP STATISTICS

Page 2: Chapter 3: Scatterplots and Correlation AP STATISTICS.
Page 3: Chapter 3: Scatterplots and Correlation AP STATISTICS.
Page 4: Chapter 3: Scatterplots and Correlation AP STATISTICS.
Page 5: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Size and abundance of carnivores

Page 6: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Basically r tells us the strength and direction of two variables

Page 7: Chapter 3: Scatterplots and Correlation AP STATISTICS.

•1. –1 r 1

•2. Value of r does not change if all values of either variable are converted to a different scale

•3. Interchanging all x and y values will not change r

•4. r measures strength of a linear relationship

•5. No linear relationship does not imply no relationship at all. There is a possibility of a non-linear relationship.

Properties of the Linear Correlation Coefficient r

Page 8: Chapter 3: Scatterplots and Correlation AP STATISTICS.

non-linear relationship

0

50

100

150

200

250

0 1 2 3 4 5 6 7 8

Distance

(feet)

Time (seconds)

Title

Page 9: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Correlation tells us about

strength (scatter) and direction of

the linear relationship between

two quantitative variables.

R does not change when we

change units of measurement.

R is not resistant and is strongly

affected by outliers

In addition, we would like to have a numerical description of how both

variables vary together. For instance, is one variable increasing faster

than the other one? And we would like to make predictions based on that

numerical description.

Abundance of carnivores

Page 10: Chapter 3: Scatterplots and Correlation AP STATISTICS.
Page 11: Chapter 3: Scatterplots and Correlation AP STATISTICS.

•r = 1

A perfect straight line

tilting up to the right

•r = 0

No overall tilt

No relationship?

Not so no linear one

•r = – 1

A perfect straight line tilting down to the right

Interpreting Correlation

X

Y

X

Y

X

Y

X

Y

X

Y

X

Y

Page 12: Chapter 3: Scatterplots and Correlation AP STATISTICS.

No Correlation Nonlinear Correlation

r = – 0.471 r = 0.089 r = 0.395

Page 13: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Examples

0

30

60

90

0 100 200Pages per person

Min

utes

per

per

son

eBayYahoo!

MSN

0 100 200Pages per person

Yahoo!

Internet Site Ratings

–Correlation is r = 0.964

•Very strong positive association

•(since r is close to 1)

–Linear relationship

•Straight line with scatter

–Increasing relationship

•Tilts up and to the right positive

5.0%

5.5%

6.0%

0% 1% 2% 3% 4%Loan fee

Inte

rest

rat

e

Mortgage Rates & Fees –Correlation is r = – 0.890

•Strong negative association

–Linear relationship

•Straight line with scatter

–Decreasing relationship

•Tilts down and to the right

Page 14: Chapter 3: Scatterplots and Correlation AP STATISTICS.

-3%

-2%

-1%

0%

1%

2%

3%

-3% -2% -1% 0% 1% 2% 3%

Yesterday's change

Toda

y's

chan

ge

The Stock Market

Example

–Is there momentum?

•If the market was up yesterday, is it more likely to be up today?

•Or is each day’s performance independent?

–Correlation is r = 0.11

•A weak relationship?

–No relationship?

•Tilt is neither

up nor down

Page 15: Chapter 3: Scatterplots and Correlation AP STATISTICS.

120

130

140

150

160

500 600 700 800 900

Temperature

Yie

ld o

f pr

oces

s

Maximizing Yield

Example–A nonlinear relationship

•Not a straight line: A curved relationship

–Correlation r = – 0.0155

•r suggests no relationship but in reality there is a relationship it is just NOT linear

–But relationship is strong

•It tilts neither up nor down

Page 16: Chapter 3: Scatterplots and Correlation AP STATISTICS.

unequal variability

Example

0

1,000

2,000

0 1,000 2,000Investment($millions)

Cir

cuit

mil

es(m

illi

ons)

Correlation r = 0.820

15

20

15 20Log of investment

Log

of

mil

es

Variability is stabilized by taking logarithms (lower right

investmentinvestment

Page 17: Chapter 3: Scatterplots and Correlation AP STATISTICS.

All the same R value

However, making the scatterplots shows us that the correlation/ regression analysis is not appropriate for all data sets.

Moderate linear

association;

regression OK.

Obvious nonlinear

relationship;

regression

inappropriate.

One point deviates from the (highly linear) pattern of the other points; it requires examination before a regression can be done

Just one very

influential point and

a series of other

points all with the

same x value; a

redesign is due

here…

Page 18: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Always plot your data!The correlations all give r ≈ 0.816, and the regression lines are all approximately = 3 + 0.5x. For all four sets, we would predict = 8 when x = 10. The previous slides data.

Page 19: Chapter 3: Scatterplots and Correlation AP STATISTICS.

We want the line as close as possible to the points.

Page 20: Chapter 3: Scatterplots and Correlation AP STATISTICS.

So, we need the line that will minimize the sum of the squares of the vertical distances.

Page 21: Chapter 3: Scatterplots and Correlation AP STATISTICS.

The regression lineThe least-squares regression line is the unique line such that the sum of the squared vertical (y) distances between the data points and the line is the smallest possible.

Distances between the points and line are squared so all are positive values. This is done so that distances can be properly added

BAC levels

Page 22: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Properties LSRLThe least-squares regression line can be shown to have this equation:

ˆ y (y rx sy

sx

) rsy

sx

x , or ˆ y a bx

ˆ y is the predicted y value (y hat)

b is the slope

a is the y-intercept

"a" is in units of y

"b" is in units of y/units of x

Page 23: Chapter 3: Scatterplots and Correlation AP STATISTICS.

How to:

First we calculate the slope of the line, b, from statistics we already know:

r is the correlation

sy is the standard deviation of the response variable y

sx is the the standard deviation of the explanatory variable x

b rsy

sx

Once we know b, the slope, we can calculate a, the y-intercept:

a y bx where x and y are the sample means of the x and y variables

This means that we don’t have to calculate a lot of squared distances to find the least-squares regression line for a data set. We can instead rely on the equation.

Page 24: Chapter 3: Scatterplots and Correlation AP STATISTICS.

BEWARE !!!Not all calculators and software use the same convention:

We use this one:

Ti-89 and Ti-83

give both

Some use instead:

ˆ y a bx

ˆ y ax b

Make sure you know what YOUR calculator gives you for a and b before you answer homework or exam questions.

Page 25: Chapter 3: Scatterplots and Correlation AP STATISTICS.

The y-interceptSometimes the y-intercept is not biologically possible. Here we have

negative blood alcohol content, which makes no sense…

But the negative value is

appropriate for the equation of

the regression line.

There is a lot of scatter in the

data and the line is just an

estimate.

BAC

Page 26: Chapter 3: Scatterplots and Correlation AP STATISTICS.

The equation completely describes the regression line.

To plot the regression line, you only need to plug two x values into the equation, get y, and draw the line that goes through those two points.

Hint: The regression line always passes through the mean of x and y

Thus: The points you use for drawing the regression line are derived from the equation.

They are NOT points from your

sample data (except by pure

coincidence).

Regression examines the distance of all points from the line in the y direction only

( , )x y

Page 27: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Correlation and regression

The correlation is a

measure of spread

(scatter) in both the x and

y directions in the linear

relationship

In regression we examine the variation in the response variable (y) given change in the explanatory variable (x)

Page 28: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Equations

The general form for a regression line is

Page 29: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Coefficient of determination, r2

•It measures the fraction (or percent) of the variation in y that is explained by the least squares regression line of y on x. (Memorize this sentence!)

•Basically, this helps us interpret r in a more understandable way since r2 is a percent.

Page 30: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Coefficient of determination, r2

r2, the coefficient of determination, is the square of the correlation coefficient r

r2 represents the percentage of the variance in y (vertical scatter from the regression line) that can be explained by changes in x.

Page 31: Chapter 3: Scatterplots and Correlation AP STATISTICS.

r = 0r2 = 0

Changes in x

explain 0% of the

variations in y.

The value(s) y

takes is (are)

entirely

independent of

what value x

takes.

Changes in x

explain 100% of the

variations in y.

y can be entirely

predicted for any

given value of x.

Here the change in x only

explains 76% of the change

in y. The rest of the change

in y (the vertical scatter,

shown as red arrows) must

be explained by something

other than x.

r = 0.87r2 = 0.76

Page 32: Chapter 3: Scatterplots and Correlation AP STATISTICS.

r =0.9r2 =0.81

There is quite some variation in BAC for the same number of beers drunk. A person’s blood volume is a factor in the equation that was overlooked here.

We changed the number of beers to the number of beers/weight of a person in pounds.

In the first plot, number of beers only explains 49% of the variation in blood alcohol content.

But number of beers/weight explains 81% of the variation in blood alcohol content.

Additional factors contribute to variations in BAC among individuals (like maybe some genetic ability to process alcohol).

Page 33: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Grade performance

If class attendance explains 16% of the variation in grades, what is the correlation between percent of classes attended and grade?

1. We need to make an assumption: Attendance and grades are positively correlated. So r will be positive too.

2. r2 = 0.16, so r = +√0.16 = + 0.4

A weak correlation.

Page 34: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Residuals

Points above the line have a positive residual.

Points below the line have a negative residual

Predicted y

Observed y residual )ˆ( dist. yy

The distances from each point to the least-squares regression line give us potentially useful information about the contribution of individual data points to the overall pattern of scatter.

These distances are called “residuals.” The sum of these residuals is always 0

Page 35: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Residual plotsResiduals are the distances between y-observed and y-predicted. We plot them in a residual plot.

If residuals are scattered randomly around 0, chances are your data

fit a linear model, were normally distributed, and you didn’t have outliers

Page 36: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Only the y-axis is different

The x-axis in a residual plot is the same as on the scatterplot.

The line on both plots is the regression line.

Page 37: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Residuals are randomly scattered—good!

A curved pattern—means the relationship

you are looking at is not linear but does

not mean there is no relationship.

A change in variability across plot is a

warning sign. You need to find out why it

is and remember that predictions made in

areas of larger variability will not be as

good.

Page 38: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Outliers and influential pointsOutlier: An observation that lies outside the overall pattern of observations.

“Influential individual”: An observation that markedly changes the regression if removed. This is often an outlier on the x-axis.

Child 19 = outlier in y direction

Child 18 = outlier in x direction

Child 19 is an outlier of the relationship.

Child 18 is only an outlier in the x

direction and thus might be an

influential point.

Page 39: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Example: Cost and Quantity

Cost vs. Number Produced

An outlier is visible

•A disaster (a fire at the factory)

•High cost, but few produced

0

10,000

0 20 40 60Number produced

Cos

t

3,000

4,000

5,000

20 30 40 50Number produced

Cos

t

Cost vs. Number Produced

outlier Outlier removed

Page 40: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Are these

points

influential?

Influential

Outlier in y-direction

Page 41: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Observations with Observations with largelarge (in absolute (in absolute value) value) residualsresiduals..

Observations falling Observations falling f a rf a r from the from the regression line while not following the regression line while not following the patternpattern of the of the relationshiprelationship apparent in the apparent in the othersothers

Residual=actual-fitted Residual = y-Residual=actual-fitted Residual = y-y(hat)y(hat)

OUTLIERS IN TERMS OF REGRESSION:OUTLIERS IN TERMS OF REGRESSION:

Page 42: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Points whose removal would greatly Points whose removal would greatly affect the association of two variablesaffect the association of two variables

Points whose removal would Points whose removal would significantly change the slope of an significantly change the slope of an LSR lineLSR line

Points with a large moment (i.e they Points with a large moment (i.e they are far away from the rest of the data.)are far away from the rest of the data.)

Usually outliers in the x direction.Usually outliers in the x direction.

INFLUENTIAL POINTS ARE:INFLUENTIAL POINTS ARE:

Page 43: Chapter 3: Scatterplots and Correlation AP STATISTICS.

The two graphs below show the same data – the one on The two graphs below show the same data – the one on the right with the removal of thethe right with the removal of the green data pointgreen data point. As. As you can see, the removal of thisyou can see, the removal of this point point significantly significantly affects the slope of the regression line. This is an affects the slope of the regression line. This is an influential point!influential point!

Page 44: Chapter 3: Scatterplots and Correlation AP STATISTICS.

An observation does NOT have An observation does NOT have

to be an Outlier to be an to be an Outlier to be an

Influential Point!! Influential Point!!

Nor does an observation need Nor does an observation need

to be an Influential Point in orderto be an Influential Point in order

to be an Outlier!!to be an Outlier!!

!!!REMEMBER!!!!!!REMEMBER!!!

Page 45: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Given the five-number summary Given the five-number summary {8 21 35 43 77}, which of the {8 21 35 43 77}, which of the following is correct?following is correct?

A. There are no outliersA. There are no outliers

B. There are at least two outliersB. There are at least two outliers

C. There is not enough data to make C. There is not enough data to make any conclusionany conclusion

D. There is exactly one outlierD. There is exactly one outlier

E. There is at least one outlierE. There is at least one outlier

Page 46: Chapter 3: Scatterplots and Correlation AP STATISTICS.

The correct answer is EThe correct answer is EThe five number summary gives youThe five number summary gives you

{Min Q{Min Q11 Median Q Median Q33 Max} Max}

The IQR is calculated by QThe IQR is calculated by Q33-Q-Q11

So, the IQR for the given data is 43-21=22So, the IQR for the given data is 43-21=22

An outlier for this data would be: An outlier for this data would be:

>Q>Q33+1.5*IQR or <Q+1.5*IQR or <Q11-1.5*IQR-1.5*IQR

>43+(22*1.5)=76 or <21-(22*1.5)=-12>43+(22*1.5)=76 or <21-(22*1.5)=-12

Since the max is 77, there must be Since the max is 77, there must be at least oneat least one outlieroutlier in this in this data set, but we cannot conclude how many outliers without data set, but we cannot conclude how many outliers without more data.more data.

Page 47: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Given the following scatterplot and residual plot. Which of the Given the following scatterplot and residual plot. Which of the following is true about the yellow data point?following is true about the yellow data point?

I. It is an influential pointI. It is an influential point

II. It is an outlier with respect to the regression modelII. It is an outlier with respect to the regression model

III. It appears to be an outlier in the x directionIII. It appears to be an outlier in the x direction

A. I onlyA. I only

B. I and IIB. I and II

C. I and IIIC. I and III

D. None of the aboveD. None of the above

E. All of the aboveE. All of the above

0 5 10 15

Page 48: Chapter 3: Scatterplots and Correlation AP STATISTICS.

The correct answer is cThe correct answer is cI.I. Because this point has a Because this point has a large momentlarge moment and is and is far from the rest of the data, it is an influential point. far from the rest of the data, it is an influential point. If this point was removed, the slope of the line would If this point was removed, the slope of the line would markedly change.markedly change.

II.II. This point is not an outlier with respect to the This point is not an outlier with respect to the model because as you can see in the residual plot, it model because as you can see in the residual plot, it does does not have a large residualnot have a large residual (It follows the (It follows the regression pattern of the data).regression pattern of the data).

III.III. By looking at both the scatterplot and the By looking at both the scatterplot and the residual plot, you can see that the yellow point is an residual plot, you can see that the yellow point is an outlier in the x directionoutlier in the x direction (far right of the rest of the (far right of the rest of the data). data).

Page 49: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Example: Salary and Experience•Salary vs. Years Experience

–For n = 6 employees

–Linear (straight line) relationship

–Increasing relationship

•higher salary generally goes with higher experience

–Correlation r = 0.8667

Experience

15

10

20

5

15

5

Salary

30

35

55

22

40

27

Mary earns $55,000

per year, and has

20 years of experience

20

30

40

50

60

0 10 20 ExperienceSala

ry (

$tho

usan

d)

Salary and Experience

Page 50: Chapter 3: Scatterplots and Correlation AP STATISTICS.

The Least-Squares Line Y=a+bX•Summarizes bivariate data: Predicts Y from X

–with smallest errors (in vertical direction, for Y axis)

–Intercept is 15.32 salary (at 0 years of experience)

–Slope is 1.673 salary (for each additional year of experience, on average)

10

20

30

40

50

60

0 10 20Experience (X)

Sala

ry (Y

)

Salary = 15.32 + 1.673 ExperienceSalary

Page 51: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Predicted Values and Residuals•Predicted Value comes from Least-Squares Line

–For example, Mary (with 20 years of experience)

has predicted salary 15.32+1.673(20) = 48.8

•So does anyone with 20 years of experience

•Residual is actual Y minus predicted Y

–Mary’s residual is 55 – 48.8 = 6.2

•She earns about $6,200 more than the predicted salary for a person with 20 years of experience

•A person who earns less than predicted will have a negative residual

( )y

Page 52: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Predicted and Residual (continued)

10

20

30

40

50

60

0 10 20Experience

Sala

ry

Mary’s predicted value is 48.8

Mary earns 55 thousand

Mary’s residual is 6.2Predicted salaries are about 6.52 (i.e., $6,520) away from actual salaries

2

11 2

n

nrSS Ye

52.6 26

168667.01686.11 2

eS

Page 53: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Making predictions: interpolationThe equation of the least-squares regression allows you to predict y for

any x within the range studied. This is called interpolating.

Nobody in the study drank 6.5

beers, but by finding the value

of from the regression line for

x = 6.5, we would expect a

blood alcohol content of 0.094

mg/ml.

mlmgy

y

/ 0944.00008.0936.0ˆ

0008.05.6*0144.0ˆ

Page 54: Chapter 3: Scatterplots and Correlation AP STATISTICS.

DeathsYear Powerboats Dead Manatees

1977 447 13

1978 460 21

1979 481 24

1980 498 16

1981 513 24

1982 512 20

1983 526 15

1984 559 34

1985 585 33

1986 614 33

1987 645 39

1988 675 43

1989 711 50

1990 719 47

There is a positive linear relationship between the number of powerboats registered and the number of manatee deaths. The least-squares regression line has for equation:

ˆ y 0.125x 41.4

Thus, if we were to limit the number of powerboat registrations to 500,000, what

could we expect for the number of manatee deaths? Roughly 21 manatees

1.214.415.62ˆ 4.41)500(125.0ˆ yy

Page 55: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Extrapolation

Extrapolation is the use of a

regression line for predictions outside

the range of x values used to obtain

the line.

This can be a very stupid thing to do,

as seen here.

Page 56: Chapter 3: Scatterplots and Correlation AP STATISTICS.

Do not use a regression on inappropriate data.

Pattern in the residuals

Presence of large outliers Use residual plots for

help.

Clumped data falsely appearing linear

Recognize when the correlation/regression is performed on

averages.

A relationship, however strong, does not itself imply causation.

Beware of lurking variables.

Avoid extrapolating (going beyond interpolation).

Caution with regression