Chapter 5

60
Chapter 5 Summarizing Bivariate Data

description

Chapter 5. Summarizing Bivariate Data. Weight. Age. There does not appear to be a relationship between age and weight in adults. Suppose we found the age and weight for each person in a sample of 10 adults. Is there any relationship between the age and weight of these adults? - PowerPoint PPT Presentation

Transcript of Chapter 5

Page 1: Chapter 5

Chapter 5

Summarizing Bivariate Data

Page 2: Chapter 5

Suppose we found the age and weight for each person in a sample of 10 adults. Is there any relationship between the age and weight of these adults?

Create a scatterplot of the data below.

Do you think there is a relationship? If so,

what kind? If not, why not?

Age 24 30 41 28 50 46 49 35 20 39

Wt 256 124 320 185 158 129 103 196 110 130

Age

Weig

ht

There does not appear to

be a relations

hip between age and

weight in adults.

Page 3: Chapter 5

Suppose we found the height and weight for each person in a sample of 10 adults. Is there any relationship between the height and weight of these adults?

Create a scatterplot of the data below.

Ht 74 65 77 72 68 60 62 73 61 64

Wt 256 124 320 185 158 129 103 196 110 130

Is it positive or negative? Weak or strong?

Do you think there is a relationship? If so,

what kind? If not, why not?

Height

Weig

ht

Page 4: Chapter 5

Correlation

•The relationship between bivariate numerical variables

– May be positive or negative

– May be weak or strongWhat does it mean if

the relationship is positive?Negative?

What feature(s) of the graph would indicate a weak or strong relationship?

Page 5: Chapter 5

Identify the strength and direction of the following data sets.

Set A Set B Set C

Set A shows a strong, positive linear relationship.

Set B shows little or no relationship.

Set C shows a weaker (moderate), negative linear

relationship.

Set D Set D shows a strong, positive curved

relationship.

Page 6: Chapter 5

Identify as having a positivepositive relationship, a negativenegative relationship, or nono relationship.1. Heights of mothers and heights of

their adult daughters++

2. Age of a car in years and its current value3. Weight of a person and calories consumed4. Height of a person and the person’s birth month

5. Number of hours spent in safety training and the number of accidents that occur

--++nnoo

--

Page 7: Chapter 5

Correlation Coefficient (r)-• A quantitativequantitative assessment of the

strength and direction of the linear relationship in bivariate, quantitative data

• Pearson’s sample correlation is used the most

• Population correlation coefficient - (rho)• statistic correlation coefficient – r

• Equation:

y

i

x

i

s

yy

s

xx

nr

1

1

What are these values

called?

These are the z-scores for x and y.

Page 8: Chapter 5

Example 5.1

For the six primarily undergraduate universities in California with enrollments between 10,000 and 20,000, six-year graduation rates (y) and student-related expenditures per full-time students (x) for 2003 were reported as follows:

Create a scatterplot and calculate r.

Expenditures

8011

7323 8735 7548 7071 8248

Graduation rates

64.6 53.0 46.3 42.5 38.5 33.9

Page 9: Chapter 5

Example 5.1 ContinuedExpenditures

8011

7323 8735 7548 7071 8248

Graduation rates

64.6 53.0 46.3 42.5 38.5 33.9

Expenditures

Gra

du

ati

on

Rate

s r = 0.05

In order to interpret what this number tells us, let’s investigate the

properties of the correlation coefficient

Page 10: Chapter 5

Moderate CorrelationStrong correlation

Properties of r(correlation coefficient)

1) legitimate values are -1 < r < 1

0 .5 .8 1-1 -.8 -.5

No Correlation

Weak correlation

Page 11: Chapter 5

2) value of r is not changed by any linearlinear transformationtransformation

Suppose that the graduation rates were changed from percents to decimals (divide by 100).Transform the graduation rates and calculate r.

Do the following transformations and

calculate r1) x’ = 5(x + 14)2) y’ = (y + 30) ÷ 4

Expenditures

8011

7323 8735 7548 7071 8248

Graduation rates

64.6 53.0 46.3 42.5 38.5 33.9

r = 0.05 It is the same!Why?

Page 12: Chapter 5

3) value of r does not depend on whichwhich of the two variables is labeled x

Suppose we wanted to estimate the expenditures per student for given graduation rates.Switch x and y, then calculate r.

Expenditures

8011

7323 8735 7548 7071 8248

Graduation rates

64.6 53.0 46.3 42.5 38.5 33.9

r = 0.05 It is the same!

Page 13: Chapter 5

4) value of r is affected affected by extreme values.

Plot a revised scatterplot and find r.

Expenditures

8011

7323 8735 7548 7071 8248

Graduation rates

64.6 53.0 46.3 42.5 38.5 33.9

Suppose the 33.9 was REALLY 63.9. What do you think would happen to the

value of the correlation coefficient?

63.9

Extreme values affect the correlation coefficientExpenditures

Gra

du

ati

on

Rate

s

Expenditures

Gra

du

ati

on

Rate

s

r = 0.42

Page 14: Chapter 5

Find the correlation for these points:x -3 -1 1 3 5 7 9Y 40 20 8 4 8 20 40

Compute the correlation coefficient?

Sketch the scatterplot

5) value of r is a measure of the extent to which x and y are linearlylinearly related

r = 0

x

yr = 0, but the data set has a

definitedefinite relationship!

Does this mean that there is NO relationship between

these points?

Page 15: Chapter 5

Recap the Properties of r:

1. legitimate values of r are -1 < r < 12. value of r is not changed by any

transformationtransformation3. value of r does not depend on whichwhich

of the two variables is labeled x4. value of r is affected by extreme affected by extreme

valuesvalues5. value of r is a measure of the extent

to which x and y are linearlylinearly related

Page 16: Chapter 5

Example 5.1 ContinuedExpenditures

8011

7323 8735 7548 7071 8248

Graduation rates

64.6 53.0 46.3 42.5 38.5 33.9

Expenditures

Gra

du

ati

on

Rate

s Interpret r = 0.05

In order to interpret r, recall the definition of the

correlation coefficient.

A quantitativequantitative assessment of the strength and direction of the linear relationship between bivariate, quantitative data

There is a weak, positive, linear relationship between expenditures and graduation rates.

Page 17: Chapter 5

Does a value of r close to 1 or -1 mean that a change in one variable cause a change in the other variable?Consider the following examples:• The relationship between the number of

cavities in a child’s teeth and the size of his or her vocabulary is strong and positive.

• Consumption of hot chocolate is negatively correlated with crime rate.

These variables are both strongly related to the age of

the child

Both are responses to cold weather

Causality can only be shown by carefully controlling values of all variables that might be related to

the ones under study. In other words, with a well-controlled,

well-designed experiment.So does this mean I should feed children more candy to increase

their vocabulary?

Should we all drink more hot chocolate to lower the crime

rate?

Page 18: Chapter 5

Correlation does not imply causation

Correlation does not imply causation

Correlation does Correlation does not imply not imply causationcausation

Page 19: Chapter 5

What is the objective of regression analysis?

•x – variable: is the independent or explanatory variable

•y- variable: is the dependent or response variable

•We will use values of x to predict values of y.

Suppose that we have two variables:

x = the amount spent on advertisingy = the amount of sales for the product

during a given period

What question might I want to answer using this data?

The objective of regression analysis is to use information about one

variable, x, to draw some sort of a conclusion about a second variable,

y.

Page 20: Chapter 5

b – is the slope– it is the approximate amount by

which y increases when x increases by 1 unit

a – is the y-intercept– it is the approximate height of the

line when x = 0– in some situations, the y-intercept has

no meaning

The LSRL is bxay ˆ

y - (y-hat) means the predicted y

Be sure to put the hat on the y

Scatterplots frequently exhibit a linear pattern. When this is the case, it makes sense to summarize the

relationship between the variables by finding a line that is as close as possible to the plots in the plot.

This is done by calculating the line of best fit or Least Square Regression

Line (LSRL).

The LSRL is the line that minimizesminimizes the sum of the squares

of the deviations from the line

The slope of the LSRL is

2

xx

yyxxb

The intercept of the LSRL is xbya

Let’s explore

what this means . .

.

Page 21: Chapter 5

(3,10)

(6,2)

Sum of the squares = 61.25

45.ˆ xy

-4

4.5

-5

y =.5(0) + 4 = 4

0 – 4 = -4

(0,0)

y =.5(3) + 4 = 5.5

10 – 5.5 = 4.5

y =.5(6) + 4 = 7

2 – 7 = -5

Suppose we have a data set that consists of the observations (0,0), (3,10) and 6,2).

Let’s just fit a line to the

data by drawing a

line through what appears

to be the middle of the

points.

Now find the vertical

distance from each point to

the line.

Find the sum of the squares

of these deviations.

Page 22: Chapter 5

(0,0)

(3,10)

(6,2)

Sum of the squares = 54

33

1ˆ xy

Use a calculator to find the line of

best fit

Find the vertical deviations from the line

-3

6

-3

What is the sum of the

deviations from the line?

Will it always be zero?

The line that minimizesminimizes the sum of the squares of the deviations from the

line is the LSRLLSRL.

Find the sum of the squares of the

deviations from the line

Page 23: Chapter 5

Researchers are studying pomegranate's antioxidants properties to see if it might be helpful in the treatment of cancer. In one study, mice were injected with cancer cells and randomly assigned to one of three groups, plain water, water supplemented with .1% pomegranate fruit extract (PFE), and water supplemented with .2% PFE. The average tumor volume for mice in each group was recorded for several points in time. (x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume (in mm3)

x 11 15 19 23 27

y 150 270 450 580 740

Sketch a scatterplot for this data set.

Page 24: Chapter 5

Pomegranate study continued

x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume

x 11 15 19 23 27

y 150 270 450 580 740

Calculate the LSRL and the correlation coefficient.

Interpret the slope and the correlation coefficient in context.

998.025.3775.269ˆ rxy

The average volume of the tumor increases by approximately 37.25 mm3 for each day increase in the

number of days after injection.

Remember that an interpretation is

stating the definition in context.

There is a strong, positive, linear relationship between the average

tumor volume and the number of days since injection.

Does the intercept have meaning in this context? Why or why not?

Page 25: Chapter 5

Pomegranate study continued

x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume

x 11 15 19 23 27

y 150 270 450 580 740

Predict the average volume of the tumor for 20 days after injection.

Predict the average volume of the tumor for 5 days after injection.

xy 25.3775.269ˆ

3mm25.475)20(25.3775.269ˆ y

3mm5.83)5(25.3775.269ˆ y

Can volume be negative?

This is the danger of extrapolation. The least-squares line should not be used to make predictions

for y using x-values outside the range in the

data set.Why?

It is unknown whether the pattern observed in the scatterplot continues outside the range of x-values.

Page 26: Chapter 5

Pomegranate study continued

x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume

x 11 15 19 23 27

y 150 270 450 580 740

Suppose we want to know how many days after injection of cancer cells would the average tumor size be 500 mm3?

xy 25.3775.269ˆ

Is this the appropriate regression line to

answer this question?

No, the slope of the line for predicting x is

not

and the intercepts are almost always different.

Here is the appropriate regression line:

y

x

s

sr

x

y

s

sr

yx 027.277.7ˆ The regression line of y on x should not be

used to predict x, because it is not the line that minimizes the sum of the

squared deviations in the x direction.

Page 27: Chapter 5

Pomegranate study continued

x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume

x 11 15 19 23 27

y 150 270 450 580 740

Find the mean of the x-values (x) and the mean of the y-values (y).

Plot the point of averages (x,y) on the scatterplot.

x = 19 and y = 438+

Will the point of averages always be on the regression

line?

Page 28: Chapter 5

Let’s investigate how the LSRL and correlation coefficient change when different points are added to the data setSuppose we have the following data set.

x 4 5 6 7 8y 2 5 4 6 9

Sketch a scatterplot. Calculate the LSRL and the correlation coefficient.

916.0

5.18.3ˆ

r

xy

Page 29: Chapter 5

Let’s investigate how the LSRL and correlation coefficient change when different points are added to the data setSuppose we have the following data set.

x 4 5 6 7 8y 2 5 4 6 9

Suppose we add the point (5,8) to the data set. What happens to the regression line and the correlation coefficient?

916.0

5.18.3ˆ

r

xy

5

8

667.0

17.115.1ˆ

r

xy

What happened?

Page 30: Chapter 5

Let’s investigate how the LSRL and correlation coefficient change when different points are added to the data setSuppose we have the following data set.

x 4 5 6 7 8y 2 5 4 6 9

Suppose we add the point (12,12) to the data set. What happens to the regression line and the correlation coefficient?

916.0

5.18.3ˆ

r

xy

12

12

959.0

225.124.2ˆ

r

xy

What happened?

Page 31: Chapter 5

Let’s investigate how the LSRL and correlation coefficient change when different points are added to the data setSuppose we have the following data set.

x 4 5 6 7 8y 2 5 4 6 9

Suppose we add the point (12,0) to the data set. What happens to the regression line and the correlation coefficient?

916.0

5.18.3ˆ

r

xy

12

0

248.0

275.026.6ˆ

r

xy

What happened?

Page 32: Chapter 5

The correlation coefficient and the LSRL are both measures that are affected by extreme

values.

Page 33: Chapter 5

Pomegranate study revisited

x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume

x 11 15 19 23 27

y 150 270 450 580 740

Minitab, a statistical software package, was used to fit the least-squares regression line. Part of the resulting output is shown below.The regression equation is

Predicted volume = -269.75 + 37.25 days

Predictor Coef SE Coef T P

Constant -269.75 23.421412 -11.51724 0.0014

Days 37.25 1.181454 31.52895 0.000

interceptslope

We will discuss what these numbers mean in the Chapter

13.

Page 34: Chapter 5

Assessing the fit of the LSRL

Important questions are:1. Is the line an appropriate way to

summarize the relationship between x and y.

2. Are there any unusual aspects of the data set that we need to consider before proceeding to use the line to make predictions?

3. If we decide to use the line as a basis for prediction, how accurate can we expect predictions based on the line to be?

Once the LSRL is obtained, the next step is to examine how effectively

the line summarizes the relationship between x and y.

We will look at

graphical and

numerical methods to answer

these questions.

Page 35: Chapter 5

In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters.x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

y 0 6.13 11.29

14.35

12.03

22.72

20.11

26.16

30.65

Predictor Coef SE Coef T P

Constant -7.69 13.33 -0.58 0.582

Distance to debris 3.234 1.782 1.82 0.112

S=8.67071 R-Sq = 32.0% R-Sq(adj) = 22.3%

Minitab was used to fit the least-squares regression line. From the partial output,

identify the regression line.

xy 234.369.7ˆ

Plot the data, including the

regression line.

Page 36: Chapter 5

In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters.x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

y 0 6.13 11.29

14.35

12.03

22.72

20.11

26.16

30.65

Dis

tance

tra

vele

d

Distance to debris

The vertical deviation between the point and the LSRL is called the

residual.

If the point is above the line, the

residual will be positive.

If the point is below the line the

residual will be negative.

Residuals are calculated by subtracting the predicted y

from the observed y.

yy ˆresidual

Page 37: Chapter 5

In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters.

Use the LSRL to calculate the predicted distance

traveled.

Subtract to find the residuals.

Distance from debris

Distance traveled (y)

Predicted distance traveled

Residual

6.94 0.00

5.23 6.13

5.21 11.29

7.10 14.35

8.16 12.03

5.50 22.72

9.19 20.11

9.05 26.16

9.36 30.65

)ˆ(y )ˆ( yy

14.76

9.23

9.16

15.28

18.70

10.10

22.04

21.58

22.59

-14.76

-3.10

2.13

-0.93

-6.67

12.62

-1.93

4.58

8.06

What does the sum of

the residuals equal?

Will the sum of the residuals always equal zero?

What does this remind you of?

Page 38: Chapter 5

Residual plotsResidual plots

• Is a scatterplot of the (x, residual) pairs.• Residuals can also be graphed against

the predicted y-values• The purpose is to determine if a linear

model is the best way to describe the relationship between the x & y variables

• If no pattern exists between the points in the residual plot, then the linear model is appropriate.

Page 39: Chapter 5

Residuals

x

Residuals

x

This residual shows no pattern so it indicates that the linear model is appropriate.

This residual shows a curved pattern so it indicates that the linear model is not appropriate.

Page 40: Chapter 5

In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters.

Distance from debris

Distance traveled (y)

Predicted distance traveled

Residual

6.94 0.00

5.23 6.13

5.21 11.29

7.10 14.35

8.16 12.03

5.50 22.72

9.19 20.11

9.05 26.16

9.36 30.65

)ˆ(y )ˆ( yy

14.76

9.23

9.16

15.28

18.70

10.10

22.04

21.58

22.59

-14.76

-3.10

2.13

-0.93

-6.67

12.62

-1.93

4.58

8.06

Use the values in

this table to create a

residual plot for this data

set. Is a linear model appropriate

for describing

the relationship between the

distance from debris

and the distance a

deer mouse will travel for food?Plot the residuals against the distance from debris

(x)

Page 41: Chapter 5

Since the residual plot displays no pattern, a linear model is appropriate for describing the relationship between the distance from debris and the distance a deer mouse will travel for food.

-15

-10

-5

5

10

15

5 6 7 8 9Distance f rom debris

Res

idua

ls

Now plot the residuals against the predicted distance from food.

Page 42: Chapter 5

-15

-10

-5

5

10

15

10 15 20 25 9

Predicted Distance traveled

Resi

dual

s

What do you notice about the general scatter of

points on this residual plot versus the

residual plot using the x-

values?

-15

-10

-5

5

10

15

5 6 7 8 9Distance f rom debris

Resid

uals

Residual plots can be plotted against either the x-values or the predicted y-values.

Page 43: Chapter 5

Let’s examine the following data set:The following data is for 12 black bears from the Boreal Forest.

x = age (in years) and y = weight (in kg)

Sketch a scatterplot with the fitted regression line.

x 10.5 6.5

28.5 10.5

6.5 7.5 6.5 5.5

7.5 11.5

9.5 5.5

Y 54 40 62 51 55 56 62 42 40 59 51 50

Do you notice anything unusual about this data set?

Influential observation

What would happen to the regression line if this point is

removed?

This point is considered an influential point because it affects the

placement of the least-squares regression line.

5 10 15 20 25 30

Predicted Distance traveled

45

40

50

55

60

Wei

ght

Age 5 10 15 20 25 30

Predicted Distance traveled

45

40

50

55

60

Wei

ght

Age

Page 44: Chapter 5

Let’s examine the following data set:The following data is for 12 black bears from the Boreal Forest.

x = age (in years) and y = weight (in kg)

x 10.5 6.5

28.5 10.5

6.5 7.5 6.5 5.5

7.5 11.5

9.5 5.5

Y 54 40 62 51 55 56 62 42 40 59 51 50

5 10 15 20 25 30

Predicted Distance traveled

45

40

50

55

60

Wei

ght

Age

Notice that this observation has a large residual.

An observation is an outlier if it

has a large residual.

Page 45: Chapter 5

Coefficient of Coefficient of determination-determination-• Denoted by r2

• gives the proportion of variationvariation in yy that can be attributed to an approximate linear relationship between x & y

Page 46: Chapter 5

Suppose you didn’t know any x-values. What distance would you expect deer mice to travel?

938.15y

Let’s explore the meaning of r2 by revisiting the deer mouse data set.

x = the distance from the food to the nearest pile of fine woody debris

y = distance a deer mouse will travel for foodx 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

y 0 6.13 11.29

14.35

12.03

22.72

20.11

26.16

30.65

What is total amount of variation in the distance traveled (y-values)? Hint: Find the sum of the squared deviations.

2SSTo yy

Total amount of variation in the distance traveled is 773.95 m2.

Why do we square the deviations?

5

10

15

20

25

30

5 6 7 8 9

Distance to DebrisD

ista

nce

tra

vele

d

SS stands for “sum of squares”

So this is the total sum of squares.

Page 47: Chapter 5

Now suppose you DO know the x-values. Your best guess would be the predicted distance traveled (the point on the LSRL).

x = the distance from the food to the nearest pile of fine woody debris

y = distance a deer mouse will travel for foodx 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

y 0 6.13 11.29

14.35

12.03

22.72

20.11

26.16

30.65

2

ˆSSResid yy The points vary from the LSRL by 526.27 m2.

By how much do the observed points vary from the LSRL?Hint: Find the sum of the residuals squared.

Dis

tance

tra

vele

d

Distance to debris

Page 48: Chapter 5

x = the distance from the food to the nearest pile of fine woody debris

y = distance a deer mouse will travel for foodx 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

y 0 6.13 11.29

14.35

12.03

22.72

20.11

26.16

30.65

The points vary from the LSRL by 526.27 m2.

Total amount of variation in the distance traveled is 773.95 m2.

Approximately what percent of the variation in distance traveled can be explained by the regression line?

320.095.77327.526

1

SSToSSResid

1

2

2

r

r

Or approximatel

y 32%

Page 49: Chapter 5

Partial output from the regression analysis of deer mouse data:

Predictor Coef SE Coef T P

Constant -7.69 13.33 -0.58 0.582

Distance to debris

3.234 1.782 1.82 0.112

S = 8.67071 R-sq = 32.0% R-sq(adj) = 22.3%

The coefficient of determination (r2)Only 32% of the observed variability in the distance traveled for food can be explained

by the approximate linear relationship between the distance traveled for food and

the distance to the nearest debris pile.

What does this number

represent?

The standard deviation (s):This is the typical amount by which an

observation deviates from the least squares regression line. It’s found by:

2-nSSResid

es

Let’s review the values from this output and their meanings.

The y-intercept (a):This value has no meaning in context since it doesn't make sense to have a

negative distance.

The slope (b):The distance traveled to food increases by

approxiamtely 3.234 meters for an increase of 1 meter to the nearest debris

pile.

Page 50: Chapter 5

Let’s examine this data set:x = representative agey = average marathon finish time

Create a scatterplot for this data set.

Age 15 25 35 45 55 65

Time 302.38

193.63

185.46

198.49

224.30

288.71

10 20 30 40 50 60

200

250

300

Representative Age

Avera

ge F

inis

h

Tim

e

Because of the curved pattern, a straight line would not accurately

describe the relationship between

average finish time and age.

Since this curve resembles a parabola, a quadratic function can

be used to describe this relationship.

221ˆ xbxbay

Using Minitab:The least-squares quadratic

regression is 2179.02.14462ˆ xxy

This curve minimizes the

sum of the squares of the

residuals (similar to least-squares linear regression).

Page 51: Chapter 5

Let’s examine this data set:x = representative agey = average marathon finish time

Age 15 25 35 45 55 65

Time 302.38

193.63

185.46

198.49

224.30

288.71

10 20 30 40 50 60

200

250

300

Representative Age

Avera

ge F

inis

h

Tim

e

Notice the residuals from the quadratic regression.

10 20 30 40 50 60

-20

-10

10

20

AgeResi

duals

Here is the residual plot-Since there is no pattern in the

residual plot, the quadratic regression is an appropriate

model for this data set.

Page 52: Chapter 5

Let’s examine this data set:x = representative agey = average marathon finish time

Age 15 25 35 45 55 65

Time 302.38

193.63

185.46

198.49

224.30

288.71

10 20 30 40 50 60

200

250

300

Representative Age

Avera

ge F

inis

h

Tim

e

The measure R2 is useful for assessing the fit of the quadratic regression.

SSToSSResid

12 R

R2 = .921

92.1% of the variation in average marathon finish times can be explained by the approximate quadratic relationship between average finish time and age.

Page 53: Chapter 5

Depending on the data set, other regression models, such as cubic regression, may be used. Statistical software (like Minitab) is commonly used to calculate these regression models.

Another method for fitting regression models to non-linear data sets is to transform the data, making it linear. Then a least-squares regression line can be fit to the transformed data.

Page 54: Chapter 5

Commonly Used TransformationsTransformation Equation

No transformation

Square root of x

Log of x *

Reciprocal of x

Log of y *Exponential growth or decay

xbay 10logˆ

bxay ˆlog10

x

bay1ˆ

xbay ˆ

bxay ˆ

*Natural log may also be used

Page 55: Chapter 5

Pomegranate study revisited:

x = number of days after injection of cancer cells in mice assigned to .2% PFE and y = average tumor volume

Sketch a scatterplot for this data set.

x 11 15 19 23 27 31 35 39

y 40 75 90 210 230 330 450 600

100

200

300

400

500

600

10 15 20 25 30 35Number of days

Avera

ge t

um

or

volu

me

There appears to be a curve in the data points.

Let’s use a transformation to linearize

the data.

Since the data appears to be exponential growth, let’s try the “log of y”

transformation

Page 56: Chapter 5

Pomegranate study revisited:

x = number of days after injection of cancer cells in mice assigned to .2% PFE and y = average tumor volume

Sketch a scatterplot of the log(y) and x.

x 11 15 19 23 27 31 35 39

Log(y) 1.60

1.88

1.95 2.32 2.36 2.52 2.65 2.78

1

2

3

10

15 20

25 30 35

3510 3025

Number of daysLog o

f A

vera

ge t

um

or

volu

me

Notice that the

relationship now

appears linear. Let’s fit an LSRL

to the transformed

data.

10 25 30 35

The LSRL is

xy 041.0226.1ˆlog

Page 57: Chapter 5

Pomegranate study revisited:

x = number of days after injection of cancer cells in mice assigned to .2% PFE and y = average tumor volume

Sketch a scatterplot of the log(y) and x.

x 11 15 19 23 27 31 35 39

Log(y) 1.60

1.88

1.95 2.32 2.36 2.52 2.65 2.78

1

2

3

10

15 20

25 30 35

3510 3025

Number of days

Log o

f A

vera

ge

tum

or

volu

me

10 25 30 35

The LSRL isxy 041.0226.1ˆlog

What would the predicted average tumor size be 30

days after injection of cancer cells?

)30(041.0226.1ˆlog y456.2ˆlog y

3456.2 mm76.28510ˆ y

Page 58: Chapter 5

Another useful transformation is the power transformation. The power transformation ladder and the scatterplot (both below) can be used to help determine what type of transformation is appropriate.

Power Transformation Ladder

Power

Transformed Value

Name

3 (Original value)3 Cube

2 (Original value)2 Square

1 (Original value) No transformation

½ Square root

1/3 Cube root

0 Log(Original value)

Logarithm

-1 Reciprocal

value Original

3 value Original

value Original1

Suppose that the scatterplot looks like the curve labeled 1.

Then we would use a power that is up the ladder from the no transformation row for both the x and y

variables.

Suppose that the scatterplot looks like the curve labeled 2.

Then we would use a power that is up the ladder from the no transformation row

for the x variable and a power down the ladder for the y

variable.

Page 59: Chapter 5

Logistic Regression (Optional)• Can be used if the dependent variable is

categorical with just two possible values• Used to describe how the probability of

“success” changes as a numerical predictor variable, x, changes

• With p denoting the probability of success, the logistic regression equation is

bxa

bxa

e

ep

1Where a and b are constants

For any value of x, the value of p is always between 0 and 1.

The graph of this equation has an “S”

shape.

Page 60: Chapter 5

In a study on wolf spiders, researchers were interested in what variables might be related to a female wolf spider’s decision to kill and consume her partner during courtship or mating. Data was collected for 53 pairs of courting wolf spiders. (Data listed on page 287)

x = the difference in body width (female – male) y = cannibalism; coded 0 for no cannibalism and 1 for cannibalism

Minitab was used to construct a scatterplot and to fit a logistic regression to the data.

x

x

e

ep

06928.308904.3

06928.308904.3

1

Note that the plot was constructed so that if two plots fell in the exact same

location they would be offset a little bit so that all points would be visible (called

jittering).

This equation can be used to predict the probability of the

male spider being cannibalized based on the difference in size.

What is the probability of cannibalism if the male &

female spiders are the same width (difference of 0)?

044.01 )0(06928.308904.3

)0(06928.308904.3

e

ep