Chapter 04 Describing Bivariate Numerical Data · 4.1: Scatterplot 1: (i) Yes, (ii) Yes, (iii)...

1

©2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly accessible website, in whole or in part.

AP* SOLUTIONS

Chapter 4 Describing Bivariate Numerical Data

Section 4.1 Exercise Set 1

4.1: Scatterplot 1: (i) Yes, (ii) Yes, (iii) Negative

Scatterplot 2: (i) Yes, (ii) No, (iii) –

Scatterplot 3: (i) Yes, (ii) Yes, (iii) Positive


4.2: (a) Negative correlation, because as interest rates rise, the number of loan applications might decrease. (b) Close to zero, because there is no reason to believe that height and IQ should be related. (c) Positive correlation, because taller people tend to have larger feet. (d) Positive correlation, because as the minimum daily temperature increases, the cooling cost would also increase.

4.3: (a) There is a moderately strong positive association between school achievement test score and midlife IQ. (b) r = 0.6, because the article says that r = 0.64 indicated a very strong relationship, higher than the correlation between height and weight in adults. Therefore, a correlation that is moderately strong (r = 0.6) with a positive association (taller people tend to weigh more) is consistent with the statement.

4.4: (a) r = 0.335. The correlation indicates that there is a weak, positive linear relationship between number of acres burned and timber sales. (b) The conclusion that “heavier logging led to large forest fires” cannot be justified because correlation does not imply causation; the correlation only measures the strength and direction of the association between two quantitative variables.

*AP and Advanced Placement Program are registered trademarks of the College Entrance Examination Board, which was not involved in the production of, and does not endorse, this product.

2


4.5: (a)

1701601501401301201101009080

880

860

840

820

800

780

760

740

720

Quality Rating

Sat

isfa

ctio

n R

atin

g

There is a weak, negative linear relationship between satisfaction rating and quality rating.

(b) The correlation of r = -0.239 indicates that there is a weak, negative linear relationship between satisfaction rating and quality rating.

4.6: No, the statement is not correct. A correlation of 0 indicates that there is not a linear relationship between two variables. There could be a strong non-linear relationship (for example, a quadratic relationship) between the two variables.

4.7: No, it is not reasonable to conclude that increasing alcohol consumption will increase income. Correlation measures the strength of association, but association does not imply causation. It could be that people with higher incomes can afford to drink more alcohol.

4.8: The correlation between college GPA and academic self-worth (r = 0.48) indicates that there is a weak or moderate positive linear relationship between those variables. This tells us that those athletes with higher a GPA tend to feel better about themselves academically than those with lower grades. The correlation between college GPA and high school GPA (r = 0.46) indicates that there is a weak or moderate positive relationship between those variables as well. This tells us that those athletes with higher high school GPA tend to also have a higher college GPA. Finally, the correlation between college GPA and a measure of tendency to procrastinate (r = -0.36) indicates that there is a weak negative linear relationship between those variables. Those athletes with a lower college GPA tend to procrastinate more than those athletes with a higher college GPA.

3



4.9: Scatterplot 1: (i) Yes, (ii) Yes, (iii) Positive

Scatterplot 2: (i) Yes, (ii) Yes, (iii) Negative


Scatterplot 4: (i) Yes, (ii) Yes, (iii) Negative

4.10: (a) Negative correlation, because larger cars tend to get lower gas mileage.

(b) Positive correlation, because larger houses tent to be more expensive.

(c) Positive correlation, because taller people tend to be heavier.

(d) Correlation close to zero, because there is no reason to believe that height and number of siblings are related to each other.

4.11: No; this value of r indicates a weak linear relationship, because this value of r lies between -0.5 and 0.5.

4.12: Graph 1: r = 1:

43210-1-2-3-4

10

5

0

-5

-10

x

y

4


Graph 2: r = –1:

43210-1-2-3-4

2

1

0

-1

-2

x

y

4.13: (a) r = 0.118

(b) Yes, it is reasonable to conclude that there is no strong relationship, linear or otherwise, between household debt and consumer debt. The correlation coefficient of r = 0.118 indicates that there is no strong linear relationship, and the scatterplot below confirms this observation.

6.36.26.16.05.95.85.7

8.0

7.5

7.0

6.5

6.0

Household Debt

Con

sum

er D

ebt

5


4.14: (a)

656055504540353025

70

60

50

40

30

20

10

Religiosity Score

Teen

Bir

th R

ate

There is a moderately strong, positively associated, linear relationship between teen birth rate and religiosity score. The scatterplot supports the statement that “teen birth rate is very highly correlated with religiosity at the state level, with more religious states having a higher rate of teen birth” because states with higher religiosity scores tend to have higher teen birth rates.

(b) The correlation of r = 0.730 indicates that there is a moderately strong, linear relationship with a positive association between teen birth rate and religiosity score.

6


Additional Exercises for Section 4.1

4.15: (a)

660640620600580560540520500

130

120

110

100

90

Mare Weight (kg)

Foal

Wei

ght

(kg)

(b) r = 0.001 (c) The scatterplot shows that there is no obvious relationship (linear or otherwise) between mare weight and foal weight. The correlation coefficient of r = 0.001 indicates that there is essentially no linear relationship between mare weight and foal weight.

4.16: The correlation coefficient between household debt and corporate debt would be positive and large, indicating a strong association between these variables. This is apparent from the time series plots because the plots follow the same general trend. Over time, both household debt and corporate debt generally follow an increasing trend (except for the change from 2001-2002, where corporate debt decreases, and household debt nearly levels off).

4.17: r = 0.987; this value is consistent with the previous answer, because the correlation coefficient is large and positive (close to 1), which indicates a strong positive association between household debt and corporate debt.

4.18: No, the correlation would not be an appropriate way to summarize the relationship between artist’s name and sale price of each painting because the correlation describes the strength

7


of the relationship between two quantitative variables. In this case, artist’s name is categorical, not quantitative.

4.19: The sample correlation coefficient would be closest to -0.9. Cars traveling at a faster rate of speed will travel the length of the highway segment more quickly than those who are traveling more slowly, and we can expect this relationship to be strong.

4.20:

2 2 2 2

2 2

88.8 86.1281.1

39 0.93588.8 86.1

288.0 286.639 39

x yxy

nrx y

x yn n

There is a strong positive linear relationship between pigment concentration in the right eye and pigment concentration in the left eye.


4.21: It makes sense to use the least squares regression line to summarize the relationship between x and y for Scatterplot 1, but not for Scatterplot 2. Scatterplot 1 shows a linear relationship between x and y, but Scatterplot 2 shows a curved relationship between x and y.

4.22: The sum of the squared vertical deviations from the line 12.5 0.5y x would be larger

than the sum of the squared vertical deviations from the least squares regression line ˆ 12 0.36y x because, by definition, the least squares regression line is the line with the

minimum value for the sum of the squared vertical deviations from the line. All other lines would have larger values for the sum of the squared vertical deviations.

4.23: (a) ˆ 5.0 0.017y x , where y represents the predicted amount of natural gas used and x

represents the size of a house. (b) ˆ 5.0 0.017(2100) 30.7y therms. (c) This is the

slope of the line, 0.017. (d) No, because the regression line was determined based on house sizes between 1000 and 3000 square feet. There is no guarantee that the linear relationship will continue outside this range of house sizes. This is the danger with extrapolation.

4.24: (a) The response variable (y) is birth weight, and the predictor variable (x) is mother’s age.

(b) It is reasonable to use a line to summarize the relationship between birth weight and mother’s age. There is a clear linear relationship between birth weight and mother’s age.

8


1918171615

3600

3400

3200

3000

2800

2600

2400

2200

Mother's Age (years)

Bir

th W

eigh

t (g

ram

s)

(c) ˆ 1163.4 245.15y x , where y represents the birth weight in grams, and x represents

the mother’s age in years.

(d) The slope of 245.15 is the amount, on average, by which the birth weight increases when the mother’s age increases by one year.

(e) It is not appropriate to interpret the intercept of the least squares regression line. The intercept is the birth weight for a mother who is zero years old, which is impossible. In addition, the intercept is negative, indicating a negative birth weight, which is also impossible.

(f) ˆ 1163.4 245.15(18) 3249.3y grams

(g) ˆ 1163.4 245.15(15) 2513.85y grams

(h) It is not advisable to use the least squares regression line to predict the birth weight of a baby born to a 23 year old mother because 23 years is well outside the range of ages used in determining the least squares regression line. There is no guarantee that the linear relationship between birth weight and mother’s age will continue outside the range of 15 to 19 years.

9


4.25: (a) The response variable is the cost of medical care, and the predictor variable is the measure of pollution.

(b)

40.037.535.032.530.027.525.0

980

970

960

950

940

930

920

910

900

890

Pollution (micrograms/cubic meter of air)

Cos

t of

Med

ical

Car

e P

er P

erso

n

There is a moderate, negative association (r = –0.581) between pollution and cost of medical care per person.

(c) ˆ 1082.2 4.691y x , where y is the cost of medical care per person, and x is the

pollution measure.

(d) The slope is negative, and it is consistent with the observed negative association in the scatterplot, as well as with the negative correlation coefficient.

(e) No, the scatterplot and least squares regression line equation do not support the researcher’s conclusion that people over the age of 65 who live in more polluted areas have higher medical costs. The association between medical cost and pollution level is negative, which indicates that people over the age of 65 in more polluted areas tend to have lower medical costs.

(f) ˆ 1082.2 4.691 35 $918.02y

(g) No, I would not use the least squares regression equation to predict the cost of medical care for a region that has a pollution measure of 60. The data set that was used to create the least squares regression line was based on pollution measures that vary between 26.8 and

10


40; the value 60 is far outside that range of values. There is no guarantee that the observed trend will continue as far as 60. This is a danger of extrapolation.

4.26: (a) The scatterplot (shown below) shows a strong linear relationship between the air flow readings from the mini-Wright and Wright meters. The correlation coefficient (r = 0.943) supports the observation made from the scatterplot. Therefore, we can confidently use the data to find the equation of the least squares regression line. That equation is ˆ 11.48 0.970y x , where y is the predicted Wright meter reading and x is the mini-

Wright meter reading.

700600500400300200

700

600

500

400

300

200

Mini-Wright Meter

Wri

ght

Met

er

(b) The predicted Wright meter reading for a mini-Wright meter reading of 500 is ˆ 11.48 0.970(500) 496.48y .

(c) The predicted Wright meter reading for a mini-Wright meter reading of 300 is ˆ 11.48 0.970(300) 302.48y .


4.27: It makes sense to use the least squares regression line to summarize the relationship between x and y in scatterplot 1 but not for scatterplot 2. Scatterplot 1 shows an approximately linear relationship between x and y, whereas scatterplot 2 shows a clearly nonlinear relationship between x and y.

4.28: The regression line is called the least squares line because the equation of the line is found by minimizing the sum of the squared vertical deviations from the line, hence the phrase least squares.

11


4.29: (a)

3210-1

850

800

750

700

Head Circumference z-score

Vol

ume

of C

ereb

ral G

rey

Mat

ter

(mL)

(b) r = 0.786

(c) ˆ 714.15 42.52y x , where y is the predicted volume of cerebral grey matter and x is

the head circumference z-score.

(d) The predicted volume of cerebral grey matter for a child whose head circumference z-score is 1.8 is ˆ 714.15 42.52(1.8) 790.686y .

(e) The head circumference z-scores in the data set range between –0.75 and 2.80. Since the head circumference z-score of 3.0 is outside this range, the least squares regression line should not be used to predict the volume of cerebral grey matter because there is no guarantee that the linear pattern will continue outside this range. This is the danger of extrapolation.

12


4.30: (a) Net directionality is the response variable (y), and mean temperature is the predictor variable (x).

(b)

20.017.515.012.510.07.55.0

0.3

0.2

0.1

0.0

-0.1

-0.2

Mean Temperature

Net

Dir

ecti

onal

ity

There is a moderately strong, positive linear relationship between net directionality and mean temperature. It should be noted that there is one point, (8.06, 0.25), that appears to be an outlier that does not follow the general trend of the data. If that point is removed, the strength of the relationship gets stronger.

(c) ˆ 0.14282 0.016141y x , where y is the predicted net directionality and x is the

mean temperature.

(d) For a 1ºC increase in mean temperature, the predicted net directionality increases by 0.0161.

(e) The intercept of the least squares regression line is the estimated net directionality at a mean temperature of 0ºC. Since the mean temperatures in the data set used to compute the least squares regression line range between 6.17 ºC and 20.25ºC, and 0ºC is outside this range, it would not be wise to interpret the intercept as the predicted net directionality when the mean temperature is 0ºC.

(f) The predicted net directionality for a mean water temperature of 15ºC is ˆ 0.14282 0.016141(15) 0.099295y .

(g) Yes, the scatterplot and least squares regression line both support this statement. Because the scatterplot has a positive association, higher values of mean temperature tend to be associated with higher values for net directionality. The slope of the regression line is

13


positive, which also indicates that larger values for mean temperature tend to be associated with higher values for net directionality.

4.31: (a)

1600014000120001000080006000400020000

110

100

90

80

70

60

50

40

30

20

Number of Visitors (thousands)

Per

cent

age

of O

pera

ting

Cos

ts C

over

ed

There is a moderately strong positive relationship between number of visitors and percentage of operating costs covered by park revenues.

(b) ˆ 28.723 0.00000422y x , where y is the predicted percentage of operating costs

covered, and x is the number of visitors.

(c) The slope is positive, which is consistent with the description in part (a).

(d) The correlation coefficient would be greater than 0.5, because the relationship is moderately strong.

4.32: The scatterplot (shown below) indicates that there is a linear relationship between the J.D. Power Quality Score and the Passenger Complaint Rank. Therefore, it is appropriate to find the least squares regression equation and use it to predict the J.D. Power Quality Scores for the two airlines not scored by J.D. Power. The least squares regression equation is ˆ 0.884 0.6994y x , where y is the predicted J.D. Power Quality Score, and x is the

Passenger Complaint Rank. The least squares regression line is also shown on the scatterplot. The predicted J.D. Power Quality Score for Hawaiian airlines, which has a passenger complaint rank of 3, is ˆ 0.884 0.6994(3) 2.9822y . The predicted J.D.

Power Quality Score for Northwest Airlines, which has a passenger complaint rank of 9, is ˆ 0.884 0.6994(9) 7.1786y .

14


121086420

10

8

6

4

2

0

Passenger Complaint Rank

J.D

. Pow

er Q

uali

ty S

core

Additional Exercises for Section 4.2

4.33: ˆ 13.5 0.195y x , where y is the predicted number of cell phone calls, and x is the age.

4.34: ˆ 22.4 0.234y x , where y is the predicted number of text messages sent, and x is the age.

4.35: Scatterplot of cell phone calls versus age:

6050403020

16

14

12

10

8

6

4

2

0

Age

Num

ber

of C

ell P

hone

Cal

ls

15


Scatterplot of text messages sent versus age:

6050403020

40

30

20

10

0

Age

Num

ber

of T

ext

Mes

sage

s S

ent

Age is a better predictor of number of cell phone calls. The linear relationship between age and number of cell phone calls is stronger than the relationship between age and number of text messages sent.

4.36: (a) The response variable is the number of fruit and vegetable servings per day, and the predictor variable is the number of hours of television viewed per day. (b) The least squares regression line would have a negative slope because the number of fruit and vegetable servings decreases by 0.14 servings for each additional hour of television viewed.

4.37: The slope is the change in predicted price for each additional mile from the Bay, so the slope would be –4,000.

4.38: (a) ˆ 59.91 27.462y x , where y is the predicted runoff sediment concentration and x is

the percentage of bare ground. (b) ˆ 59.91 27.462 18 554.226y (c) No, I would not

recommend using the least squares regression line from part (a) to predict runoff sediment concentration for gradually sloped plots because that regression line was based on data from steeply sloped plots, not gradually sloped plots. There is no reason to believe that both gradually and steeply sloped plots would have the same relationship between runoff sediment concentration and percentage of bare ground.

16



4.39: (a) ˆ 1.33878 0.007661y x , where y is the predicted telomere length, and x is the

perceived stress. (b) 2 0.099r ; Approximately 9.9% of the variability in telomere length can be explained by the linear relationship between telomere length and perceived stress.

(c) 0.159119es (obtained using software); es tells us that telomere length will deviate

from the least squares regression line by 0.159119, on average. (d) The relationship between perceived stress and telomere length is weak and negative. It is negative because the slope of the least squares regression line is negative, and it is weak because the correlation coefficient (which has the same sign as the slope) is

2 0.099 0.315r r .

4.40: A small value of es indicates that residuals tend to be small. Because residuals represent the

difference between an observed y value and a predicted y value, the value of es tells us

how much accuracy we can expect when using the least squares regression line to make predictions.

4.41: It is important to consider both r2 and es when evaluating the usefulness of the least

squares regression line because a large r2 (which indicates the proportion of variability in y that can be explained by the linear relationship between x and y) tells us that knowing the

value of x is helpful in predicting y, and a small es indicates that residuals tend to be small,

thus the data points tend to be relatively close to the regression line. These are both desirable when assessing the usefulness of the regression line.

17


4.42: (a) Yes, the scatterplot looks reasonably linear.

17.515.012.510.07.55.0

750

700

650

600

550

Representative Age

Med

ian

Dis

tanc

e (m

eter

s)

(b) ˆ 492.80 14.763y x , where y is the predicted median distance walked, and x is the

representative age.

(c) The residuals are shown in the last column of the table below. The predicted y ( y )

values were found using the least squares regression equation.

Representative Age (x)

Median Distance Walked (y)

Predicted y ( y ) Residual ( ˆy y )

4.0 544.3 551.851 -7.5510 7.0 584.0 596.141 -12.1410 10.0 667.3 640.431 26.8690 13.5 701.1 692.103 8.9974 17.0 727.6 743.774 -16.1743

18


17.515.012.510.07.55.0

30

20

10

0

-10

-20

Representative Age

Res

idua

l

Yes, there is an unusual feature in the plot, namely, there is a curved pattern in the residual plot. The curvature indicates that the relationship between median distance walked and representative age is not linear.

4.43: (a)

17.515.012.510.07.55.0

680

660

640

620

600

580

560

540

520

500

Representative Age

Med

ian

Dis

tanc

e (m

eter

s)

The pattern for girls differs from boys in that the girls’ scatterplot shows more apparent nonlinearity in the pattern.

19


(b) ˆ 480 12.525y x , where y is the girls’ predicted distance walked, and x is the

representative age.

(c)


Median Distance Walked (y)

Predicted y ( y ) Residual ( ˆy y )

4.0 492.4 530.095 -37.6952 7.0 578.3 567.669 10.6311 10.0 655.8 605.243 50.5574 13.5 657.6 649.079 8.5214 17.0 660.9 692.915 -32.0147

17.515.012.510.07.55.0

50

25

0

-25

-50

Representative Age

Res

idua

l

The curvature of the residual plot indicates that a curve is more appropriate than a line for describing the relationship between median distance walked and representative age.

20


4.44: (a)

0.250.200.150.100.050.00

0.125

0.100

0.075

0.050

0.025

0.000

-0.025

-0.050

Nitrogen Intake (grams)

Nit

roge

n R

eten

tion

(gr

ams)

(b) ˆ 0.03443 0.5803y x , where y is the predicted nitrogen retention, and x is the

nitrogen intake (both in grams). The predicted nitrogen retention for a flying squirrel whose nitrogen intake is 0.06 grams is ˆ 0.03443 0.5803(0.06) 0.000388y grams.

The residual associated with the observation (0.06, 0.01) is ˆ 0.01 0.000388 0.009612y y .

(c) The observation (0.25, 0.11) is potentially influential because that point has an x value that is far away from the rest of the data set.

(d) ˆ 0.037 0.627(0.06) 0.00062y ; This prediction is larger than the prediction made

in part (b).

21


4.45: (a)

20000150001000050000

80

70

60

50

40

30

20

Total Number of Salmon

Per

cent

age

Tran

spor

ted

There appears to be a linear relationship between Percentage Transported and Total Number of Salmon.

(b) ˆ 18.483 0.0028655y x , where y is the predicted percentage transported and x is the

total number of salmon.

(c) The observation (3928, 46.8) is not influential, because the x-value for that observation is not far from the rest of the data. In addition, removal of the potentially influential point produces a least squares regression line with a y-intercept and slope similar to the original line.

(d) Those points are not considered influential even though they are far from the rest of the data because they follow the trend of the remaining data points. Removal of those potentially influential points would produce a least squares regression line similar to the line found using the full data set.

(e) 9.16217es ; Observations tend to deviate from the least squares regression line by

9.16217 percentage points on average.

(f) 2 0.832r ; Approximately 83.2% of the variability in percentage transported can be explained by the linear relationship between percentage transported and number of salmon.

22



4.46: (a) ˆ 232.26 2.926y x , where y is the predicted algae colony density and x is the rock

surface area.

(b) 2 0.300r ; 30.0% of the variability in algae colony density can be explained by the approximate linear relationship between algae colony density and rock surface area.

(c) 63.3153es ; Observations of algae colony density tend to deviate from the values

predicted by the least squares regression line by 63.3153, on average.

(d) The relationship between rock surface area and algae colony density is negative because the slope of the regression line is negative. The relationship is moderate because

2 0.30 0.548r r .

4.47: A large value of r2 indicates that a large proportion of the variability in y can be explained by the approximate linear relationship between x and y, which is desirable. This tells us that knowing the value of x is helpful for predicting y.

4.48: It is important to consider both r2 and es when evaluating the usefulness of the least




23


4.49: (a) Yes, the pattern in the scatterplot looks linear.

240230220210200

1150

1100

1050

1000

950

900

850

800

Weld Diameter

She

ar S

tren

gth

(b) The regression equation is ˆ 941.699 8.599y x , where y is the predicted shear

strength in pounds and x is the weld diameter.

(c)

Weld Diameter (x)

Shear Strength (y)

Predicted Shear Strength ( y )

Residual ( ˆy y )

200.1 813.7 778.9609 34.7391 210.1 785.3 864.9509 -79.6509 220.1 960.4 950.9409 9.4591 230.1 1118.0 1036.931 81.0691 240.0 1076.2 1122.061 -45.861

24


240230220210200

100

50

0

-50

-100

Weld Diameter

Res

idua

ls

The residual plot shows no pattern that would call into question the appropriateness of using a linear model to describe the relationship between weld diameter and shear strength..

4.50: (a) ˆ 227.969 0.105y x , where y is the predicted average finish time and x is the

representative age.

(b)


Average Finish Time (y)

Predicted Average Finish Time ( y )

Residual ( ˆy y )

15 302.38 229.544 72.836 25 193.63 230.594 -36.964 35 185.46 231.644 -46.184 45 198.49 232.694 -34.204 55 224.30 233.744 -9.444 65 288.71 234.794 53.916

25


655545352515

75

50

25

0

-25

-50

Representative Age

Res

idua

l

The obvious curved pattern of points in the residual plot supports the decision to fit a curve rather than a straight line to describe the relationship between average finish time and representative age.

4.51: (a)

Pollution (x) Cost of Medical

Care (y) Predicted cost ( y ) Residuals ( ˆy y )

30.0 915 941.470 -26.4700 31.8 891 933.026 -42.0262 32.1 968 931.619 36.3811 26.8 972 956.481 15.5188 30.4 952 939.594 12.4064 40.0 899 894.560 4.4400

(b) The correlation coefficient is 0.581r . This value indicates that the linear relationship between pollution and medical cost is moderate, because the correlation coefficient lies between -0.8 and -0.5.

(c)

26


40.037.535.032.530.027.525.0

40

30

20

10

0

-10

-20

-30

-40

-50

Pollution

Res

idua

l

The residual plot shows one point, (40.0, 4.4), relatively far removed from the others. There is no obvious pattern in the residual plot.

(d) The point (40.0, 899) is a potentially influential point because the x-value of that point is far away from the rest of the data. However, after removing that point, we find that the slope and intercept do not change much. The slope and intercept for the full data set were given as -4.691 and 1082.2, respectively. After the potentially influential point is removed, the slope and intercept become -7.107 and 1154.4, respectively. Given the scale on the y-axis (cost of medical care), these values do not seem to have changed much.

4.52: (a) 2 0.154r

(b) No, predictions based on a least-squares line with y = first-year college GPA and x =

SAT II score would not be very accurate. Given that 2 0.16r , only 16% of the variation in first-year college GPA can be explained by the linear relationship between first-year college GPA and SAT II score. Since this value is small, it tells us that knowing the value of the SAT II score is not particularly helpful for predicting first-year college GPA.

27


4.53: (a) ˆ 4788.84 29.015y x , where y is the predicted average swim distance and x is the

representative age.

(b)


Average Swim Distance (y)

Predicted Average Swim Distance ( y )

Residual ( ˆy y )

25 3913.5 4063.47 -149.965 35 3728.8 3773.32 -44.515 45 3579.4 3483.17 96.235 55 3361.9 3193.02 168.885 65 3000.1 2902.86 97.235 75 2649.0 2612.72 36.285 85 2118.4 2322.57 -204.165

85756555453525

200

100

0

-100

-200

Representative Age

Res

idua

l

The curved pattern in the residual plot indicates that a line is not appropriate for describing the relationship between representative age and average swim distance.

(c) It would not be appropriate to use the least squares regression line to predict the average swim distance for women ages 40 to 49 by substituting the representative age of 45 into the equation because the equation was computed based on men’s swim distances and ages. There is no reason to expect that women would have the same relationship as men.

28


4.54: (a)

7000006000005000004000003000002000001000000

150

125

100

75

50

Park Size (acres)

Num

ber

of E

mpl

oyee

s

(b) The equation of the least squares regression line is ˆ 85.3345 0.00002589y x , where

y is the predicted number of employees and x is the park size. This line would not give

accurate predictions because the value of r2 is 0.016. This r2 tells us that only 1.6% of the variation in number of employees can be explained by the linear relationship between park size and number of employees.

(c) Yes, the observation (620,231, 67) does have a big effect on the equation of the line. After deleting the observation, the least squares regression equation becomes ˆ 83.4018 0.0000387y x . Note that the slope of the regression equation changed from

negative to positive.

4.55: (a) ˆ 27.7092 0.2011y x , where y is the predicted percent of operating cost covered by

revenues and x is the number of employees.

(b) Using software, we find that 2 0.053r and 24.1475es . Therefore, only 5.3% of the

variation in percent of operating cost covered by revenues can be explained by the linear relationship between these variables. In addition, the value of se is nearly the same size as many of the y values, so the prediction errors are nearly as large as the values themselves. The values of r2 and se indicate that the least squares regression line does not do a good job describing the relationship between these two variables.

(c) The points (69,80), (87, 97), and (112, 108) are outliers. No, the observations with the largest residuals do not correspond to the parks with the largest number of employees.

29


4.56:

6050403020100

18

16

14

12

10

8

6

4

2

0

Frying Time (seconds)

Moi

stur

e C

onte

nt (

%)

The scatterplot shows an obvious nonlinear relationship between moisture content and frying time, so the least squares regression line would not be appropriate to summarize the relationship between these variables.

Section 4.5: Exercise Set 1

4.57: (a)

6050403020100

18

16

14

12

10

8

6

4

2

0

Frying Time (seconds)

Moi

stur

e C

onte

nt (

%)

30


The scatterplot shows an obvious nonlinear relationship between moisture content and frying time, therefore a straight line would not provide an effective summary of the relationship between these variables.

(b)

1.81.61.41.21.00.80.6

1.2

1.0

0.8

0.6

0.4

0.2

0.0

log(Frying Time)

log(

Moi

stur

e C

onte

nt)

The scatterplot of the transformed data shows a strong linear relationship between log(Frying Time) and log(Moisture Content).

(c) Yes, the least squares line effectively summarizes the linear relationship between 'x and 'y due to the high r2 value.

(d) The equation of the least squares regression line is ˆ ' 2.02 1.05 'y x , where 'y

represents the log(Moisture Content) and 'x is the log(Frying Time). Therefore, the predicted log(Moisture Content) when the frying time is 35 seconds is

ˆ ' 2.02 1.05(log(35)) 0.3987y , which yields a Moisture Content of 0.398710 2.50 %.

31


4.58: (a)

76543210

70

60

50

40

30

20

10

0

Salmon Availability

Per

cent

age

of E

agle

s in

the

Air

There is a curved relationship between salmon availability and percentage of eagles in the air.

(b)

2.52.01.51.00.50.0

9

8

7

6

5

4

3

2

sqrt(Salmon Availability)

sqrt

(Per

cent

age

of E

agle

s in

the

Air

)

Yes, it appears that the pattern in the scatterplot is straighter than the pattern in the scatterplot in Part (a).

32


(c) Taking the square root of both variables provides a more linear relationship that still has some residual curvature. It is possible that taking the cube roots of both variables might straighten the pattern in the scatterplot even more.

2.01.51.00.50.0

4.0

3.5

3.0

2.5

2.0

cube root of Salmon Availability

cube

roo

t of

Per

cent

age

of E

agle

s in

Air

The cube root transformation seems to result in a more linear relationship.

4.59: The reciprocal transformation in scatterplot (c) seems to capture the overall trend of the data better than the other transformation.

33


Section 4.5: Exercise Set 2

4.60: (a)

2.52.01.51.00.5

100

90

80

70

60

50

40

30

Energy of Shock

Suc

cess

(%

)

The relationship appears to be nonlinear.

(b) The equation of the least squares regression line is ˆ 22.48 34.36y x , where y is the

predicted success percentage, and x is the energy of shock. The table below shows the original observations, the predicted success %, and the residuals.

Energy of Shock (x)

Success % (y)Predicted

Success % ( y ) Residual ( ˆy y )

0.5 33.3 39.66 -6.36 1.0 58.3 56.84 1.46 1.5 81.8 74.02 7.78 2.0 96.7 91.20 5.50 2.5 100.0 108.38 -8.38

The obvious curved pattern in the residual plot (shown below) supports the conclusion in Part (a).

34


2.52.01.51.00.5

10

5

0

-5

-10

Energy of Shock

Res

idua

l

(c) The scatterplot of 'x x vs. y is shown below:

1.61.41.21.00.80.6

100

90

80

70

60

50

40

30

sqrt(Energy of Shock)

Suc

cess

(%

)

The scatterplot of ' ln( )x x is shown below:

35


1.00.50.0-0.5

100

90

80

70

60

50

40

30

ln(Energy of Shock)

Suc

cess

(%

)

The residual plots for the two transformations are shown below.

1.61.41.21.00.80.6

5.0

2.5

0.0

-2.5

-5.0

sqrt(Energy of Shock)

Res

idua

l

36


1.00.50.0-0.5

4

3

2

1

0

-1

-2

-3

-4

-5

ln(Energy of Shock)

Res

idua

l

Both transformations seem to create linear patterns in the scatterplots. The residual plot for the square root transformation still shows a curved pattern, but the residual plot for the log transformation does not have as pronounced a pattern. Therefore, the second transformation (ln(x) vs. y) seems to be somewhat more successful in straightening the data.

(d) The equation of the least squares regression line is ˆ 62.42 43.81y x , where y is the

predicted success %, and x is the natural logarithm of the energy of shock.

(e) When x = 1.75, ˆ 62.42 43.81(ln(1.75)) 86.94%y

When x = 0.8, ˆ 62.42 43.81(ln(0.8)) 52.64%y

37


4.61: (a)

109876543210

70

60

50

40

30

20

Year

Num

ber

wai

ting

for

tra

nspl

ants

(th

ousa

nds)

The number of people waiting for transplants increased at a nonlinear rate from 1990 to 1999 (the number of people waiting for transplants increased by greater amounts each year).

(b) A transformation that straightens the scatterplot is (year, log(number waiting)). The scatterplot is shown below.

109876543210

1.9

1.8

1.7

1.6

1.5

1.4

1.3

Year

log(

Num

ber

Wai

ting

)

38


(c) The least squares regression equation is ˆ 1.287 0.05797y x , where y is the predicted

log of the number waiting, and x is the year (where 1 = 1990, 2 = 1991, etc.). In 2000, the predicted log of the number waiting is ˆ 1.287 0.05797(11) 1.92467y , so the predicted

number of people waiting in 2000 is 1.9246710 84.076 thousand, or approximately 84,076.

(d) We must be confident that the pattern observed from 1990 to 1999 continues into 2000. Given that 2000 is just one year outside the range of years used to develop the least squares regression model, and that the linear relationship is strong (r = 0.999), the coefficient of determination is very close to 100% (r 2 = 99.9%), and the standard deviation of the residuals is quite small (se = 0.0058), we can reasonably extrapolate to the year 2000 (year = 11 in the model). However, there is no guarantee that the relationship will continue to the year 2010, so it is unwise to use the least squares regression model to predict the number waiting for a transplant in 2010.

Section 4.5: Additional Exercises

4.62: (a)

250200150100500

160

140

120

100

80

60

40

20

0

Population Density

Agr

icul

tura

l Int

ensi

ty

Yes, the scatterplot shows a positive relationship between agricultural intensity and population density, and so it is compatible with the statement of a positive relationship.

39


(b)

6000050000400003000020000100000

160

140

120

100

80

60

40

20

0

(Population Density)^2

Agr

icul

tura

l Int

ensi

ty

Yes, the squaring transformation straightens the plot. The variability in Agricultural Intensity seems to increase as (Population Density)2 increases.

(c)

250200150100500

2.25

2.00

1.75

1.50

1.25

1.00

0.75

0.50

Population Density

log(

Agr

icul

tura

l Int

ensi

ty)

This transformation also straightens the plot. In addition, the variability in log(Agricultural Intensity) does not increase as population density increases, as was the case in part (b).

40


(d)

6000050000400003000020000100000

2.25

2.00

1.75

1.50

1.25

1.00

0.75

0.50

(Population Density)^2

log(

Agr

icul

tura

l Int

ensi

ty)

These transformations were not effective in straightening the plot. There is obvious curvature in the scatterplot.

4.63:

6543210

700

600

500

400

300

200

100

0

age (years)

cana

l len

gth

(mil

lim

eter

s)

The relationship between canal length and age is not linear. The transformations ' log(canal length)y and ' log(age)x help to straighten the pattern in the scatterplot.

41


0.750.500.250.00-0.25-0.50

3.0

2.5

2.0

1.5

1.0

log(age)

log(

cana

l len

gth)

The residual plot shown below shows no obvious pattern, and therefore supports the assertion that the transformation succeeded in straightening the pattern in the plot.

0.750.500.250.00-0.25-0.50

0.50

0.25

0.00

-0.25

-0.50

log(age)

Res

idua

l

Chapter 4: Are You Ready to Move On?

4.64: Scatterplot 1: (i) Yes, (ii) Yes, (iii) Negative


42




4.65: (a) Positive, because the cost of produce is determined by its weight, so a heavier apple would cost more. (b) Close to zero, because there is no reason to believe that the height of a person would be at all associated with the number of pets a person has. (c) Positive, because, in general, the more time spent studying for an exam the higher the score will be. (d) Positive, because, in general, the heavier a person is the longer it will take to run 1 mile.

43


4.66: (a) r = -0.10; there is a weak, negative linear relationship. Because the relationship is negative, larger arch heights tend to be paired with smaller average hopping heights.

(b) The correlation coefficients support the conclusion since they are all fairly close to 0.

4.67: (a)

80706050403020

14

13

12

11

10

9

8

7

Cost (cents per serving)

Fibe

r (g

ram

s pe

r se

rvin

g)

There is no apparent relationship between fiber and cost.

(b) 0.204r ; there is a weak positive linear relationship between fiber and cost.

(c) 0.241r ; the correlation coefficient for the per cup data is greater than the correlation coefficient for the per serving data.

4.68: (a) r = 0.944; there is a strong positive linear relationship between sugar consumption and depression rate.

(b) No, because you can’t conclude that a cause-and-effect relationship exists just based on a strong correlation.

(c) These countries were not a random sample of all countries, and it is unlikely that they are representative.

44


4.69: (a)

40.037.535.032.530.027.525.0

980

970

960

950

940

930

920

910

900

890

Pollution (micrograms/cubic meter of air)

Cos

t of

Med

ical

Car

e P

er P

erso

n

(b) There is a moderate, negative association (r = –0.581) between pollution and cost of medical care per person.

(c) The scatterplot shows a negative association between cost of medical care per person and pollution level. The sign of the correlation coefficient supports the negative association between the variables. Both the scatterplot and correlation coefficient tell us that people over the age of 65 living in more polluted areas tend to have lower medical costs.

4.70: (a) Negative, because it is likely that the work is more stressful and less enjoyable for nurses with high patient-to-nurse ratios.

(b) Negative, because it is likely that the patients at hospitals where the patient-to-nurse ratio is high will not get as much individual attention and will be less satisfied with their care.

(c) Negative, because it is likely that the quality of patient care suffers when patient-to-nurse ratio is high.

4.71: The least squares regression line is the line found by minimizing the sum of the squared vertical deviations from the line. Therefore, the least squares regression line will have the smallest sum of the squared vertical deviations compared to any other line, making choice (i) (308.6) the correct choice.

45


4.72: There is no evidence that the form of the relationship between x and y remains the same outside the range of the data.

4.73: (a) ˆ 101.33 9.296y x , where y is the predicted survival rate, and x is the mean call-to-

shock time. (b) The slope of –9.296 is the amount, on average, by which the survival rate percentage decreases when the mean call-to-shock time increases by one minute. (c) No, it does not make sense to interpret the intercept because the y-values represent survival rate as a percent, and the intercept is greater than 100%. (d) The predicted survival rate for a community with a mean call-to-shock time of 10 minutes is ˆ 101.33 9.296(10) 8.37%y .

4.74: (a) Size, because the relationship between price and size (r = 0.700) is stronger than the relationship between price and land-to-building ratio (r = -0.332).

(b) ˆ 1.33 0.00525y x , where y is the predicted sale price and x is the size in thousands

of square feet.

4.75: (a) Use the least squares regression line to predict the median hourly wage gain for the 15th year of schooling. ˆ 9.0714 1.57143y x , where y is the predicted median hourly wage

gain and x is the number of years of schooling. Therefore, the predicted median hourly wage gain for the 15th year of schooling is ˆ 9.0714 1.57143(15) 14.5%y . (b) The

residual for the 15th year of schooling is ˆ 14 14.5 0.5%y y . The prediction was

0.5% too high.

4.76: (a)

10987654321

22

21

20

19

18

17

16

15

Year (1 = 1990, 2 = 1991, etc.)

Num

ber

of t

rans

plan

ts (

thou

sand

s)

46


ˆ 14.2 0.790y x , where y is the predicted number of transplants and x is the year (using

1, 2, …, 10 to represent the years). The number of transplants has increased steadily over time, with a predicted increase of about 790 per year.

47


(b)

Year (x) Number of

Transplants (y) Predicted Number of Transplants ( y )

Residuals ( ˆy y )

1 15.0 15.0036 -0.003636 2 15.7 15.7939 -0.093939 3 16.1 16.5842 -0.484242 4 17.6 17.3745 0.225455 5 18.3 18.1648 0.135152 6 19.4 18.9552 0.444848 7 20.0 19.7455 0.254545 8 20.3 20.5358 -0.235758 9 21.4 21.3261 0.073939 10 21.8 22.1164 -0.316364

10987654321

0.50

0.25

0.00

-0.25

-0.50

Year (1 = 1990, 2 = 1991, etc.)

Res

idua

l

The residual plot shows a curved pattern, suggesting that the relationship between year and number of transplants is nonlinear.

48


4.77: (a)

1501251007550

4.5

4.0

3.5

3.0

2.5

2.0

1.5

1.0

Carapace Length (mm)

Age

(ye

ars)

ˆ 1.0755 0.029105y x , where y is the predicted age (in years) and x is the carapace

length (in mm). There is a positive nonlinear association between age and carapace length.

49


(b)

1501251007550

0.75

0.50

0.25

0.00

-0.25

-0.50

Carapace Length (mm)

Res

idua

l

There is a curved pattern in the residual plot that indicates that the relationship between age and carapace length would be better described by a curve rather than a line.

4.78: (a) No, because the value of outpatient cost-to-charge ratio (54) is not far from the other outpatient values in the data set.

(b) Yes, it is an outlier, because it has a large residual—it is far away from the least-squares regression line.

(c) No, because even though it has an outpatient value that is larger than the others in the data set, this data point is consistent with the linear pattern formed by the other data value.

(d) No, because it falls quite close to the least squares regression line. This data point has a small residual.

50


4.79: (a)

40035030025020015010050

0.50

0.45

0.40

0.35

0.30

Average Energy Density (calories per 100 grams)

Ave

rage

Cos

t (d

olla

rs)

The relationship looks approximately linear, although there is an outlier at (390, 0.37) that does not seem to follow the linear relationship firmly established by the remaining data values.

(b) ˆ 0.27970 0.0004619y x , where y is the predicted average cost in dollars, and x is

the average energy density in calories per 100 grams.

(c)

51


40035030025020015010050

0.05

0.00

-0.05

-0.10

Average Energy Density

Res

idua

l

There seems to be one data value (average energy density of 390) that has a particularly large residual when compared to the other residuals.

(d) 2 0.547r ; Approximately 54.7% of the variability in average cost is explained by the linear relationship between average cost and average energy density.

(e) It is informative to consider both r2 and es when evaluating the usefulness of the least




(f) 0.05453es (obtained using software); es tells us that average cost will deviate from

the least squares regression line by approximately 0.05453 dollars on average.

(g) Yes, it is appropriate to use the least squares regression line to predict average cost for a

food group based on its average energy density. The r2 is moderately high, and the es is

reasonably small, which are desirable traits when assessing the usefulness of the regression line.

(h) ˆ 0.27970 0.0004619(196) $0.37y

52


4.80: (a)

9080706050403020

3.5

3.0

2.5

2.0

1.5

1.0

Driver Age

Fata

lity

Rat

e

The relationship between fatality rate and driver age does not look linear.

(b) A quadratic model looks as if it would be appropriate to describe the relationship between fatality rate and driver age.

(c) The quadratic model suggested in part (b) is 2ˆ 2.360 0.08182 0.001009y x x , where

y is the predicted fatality rate, and x is the driver age. The scatterplot of the data, overlaid

with the quadratic model, is shown below.

9080706050403020

3.5

3.0

2.5

2.0

1.5

1.0

Driver Age

Fata

ity

Rat

e

53


(d) Yes, this model is useful for predicting the fatality rate because 2 96.4%R and

0.163es . The large R2 combined with a relatively small se support the statement that the

model is useful for prediction.

(e) The predicted fatality rate is 2ˆ 2.360 0.08182 78 0.001009 78 2.12%y .

Chapter 04 Describing Bivariate Numerical Data · 4.1: Scatterplot 1: (i) Yes, (ii) Yes, (iii)...

Documents

Transcript of Chapter 04 Describing Bivariate Numerical Data · 4.1: Scatterplot 1: (i) Yes, (ii) Yes, (iii)...