Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a....

50
Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs. The frequencies for the three classes are: Class Frequency X 8 Y 9 Z 3 Total 20 b. The relative frequency for each class is found by dividing the frequency by the total sample size. The relative frequency for the class X is 8/20 = .40. The relative frequency for the class Y is 9/20 = .45. The relative frequency for the class Z is 3/20 = .15. Class Frequency Relative Frequency X 8 .40 Y 9 .45 Z 3 .15 Total 20 1.00 c. The frequency bar chart is: d. The pie chart for the frequency distribution is: Methods for Describing Sets of Data 5

Transcript of Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a....

Page 1: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

Methods for Describing Sets of Data Chapter 2

2.2 a. To find the frequency for each class, count the number of times each letter occurs. The

frequencies for the three classes are:

Class Frequency X 8 Y 9 Z 3

Total 20 b. The relative frequency for each class is found by dividing the frequency by the total sample

size. The relative frequency for the class X is 8/20 = .40. The relative frequency for the class Y is 9/20 = .45. The relative frequency for the class Z is 3/20 = .15.

Class Frequency Relative Frequency

X 8 .40 Y 9 .45 Z 3 .15

Total 20 1.00 c. The frequency bar chart is: d. The pie chart for the frequency distribution is:

Methods for Describing Sets of Data 5

Page 2: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.4 a. The variable summarized in the table is ‘Reason for requesting the installation of the passenger-side on-off switch.’ The values this variable could assume are: Infant, Child, Medical, Infant & Medical, Child & Medical, Infant & Child, and Infant & Child & Medical. Since the responses name something, the variable is qualitative.

b. The relative frequencies are found by dividing the number of requests for each category by

the total number of requests. For the category ‘Infant’, the relative frequency is 1,852/30,337 = .061. The rest of the relative frequencies are found in the table below:

Reason Number of

Requests Relative

frequencies Infant 1,852 1,852/30,337 .061

Child 17,148 17,148/30,337 .565

Medical 8,377 8,377/30,337 .276

Infant & Medical 44 44/30,337 .0014

Child & Medical 903 903/30,337 .030

Infant & Child 1,878 1,878/30,337 .062

Infant & Child & Medical 135 135/30,337 .0045

TOTAL 30,337 .9999

c. Using MINITAB, a pie chart of the data is:

Medical ( 8377, 27.6%)

Inf ant&Medic ( 44, 0.1%)

Inf ant&Child ( 1878, 6.2%)

Inf ant ( 1852, 6.1%)

Inf &Chd&Med ( 135, 0.4%)Child&Medica ( 903, 3.0%)

Child (17148, 56.5%)

Pie Chart of Reason

d. There are 4 categories where Medical is mentioned as a reason: Medical, Infant & Medical, Child & Medical, and Infant & Child & Medical. The sum of the frequencies for these 4 categories is 8,377 + 44 + 903 + 135 = 9,459. The proportion listing Medical as one of the reasons is 9,459/30,337 = .312.

6 Chapter 2

Page 3: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.6 a. To find relative frequencies, we divide the frequencies of each category by the total number of incidents. The relative frequencies of the number of incidents for each of the cause categories are:

Management System Cause Category

Number of Incidents Relative Frequencies

Engineering & Design 27 27 / 83 = .325 Procedures & Practices 24 24 / 83 = .289 Management & Oversight 22 22 / 83 = .265 Training & Communication 10 10 / 83 = .120 TOTAL 83 1

b. The Pareto diagram is:

Category

Pe

rce

nt

Trn&C ommMgmt&O v erProc&PractEng&Des

35

30

25

20

15

10

5

0

Management Systen Cause Category

c. The category with the highest relative frequency of incidents is Engineering and Design. The category with the lowest relative frequency of incidents is Training and Communication.

2.8 a. The data collection method was a survey.

b. Since the data were numbers (percentage of US labor and materials), the variable is quantitative. Once the data were collected, they were grouped into 4 categories.

Methods for Describing Sets of Data 7

Page 4: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

c. Using MINITAB, a pie chart of the data is:

<50% ( 4, 3.8%)

75-99% (20, 18.9%)

50-74% (18, 17.0%)

100% (64, 60.4%)

Pie Chart of Made in USA

About 60% of those surveyed believe that “Made in USA” means 100% US labor and materials. 2.10 Using MINITAB, a bar chart of the frequency of occurrence of the industry types is:

INDUSTRY

Coun

t

Util

ities

Tran

spor

tatio

nTe

leco

mm

unic

atio

nsTe

chno

logy

Equ

ipm

ent

Soft

war

e &

Ser

vice

sSe

rvic

es/S

uppl

ies

Sem

icon

duct

ors

Ret

ailin

gO

il &

Gas

Med

iaM

ater

ials

Insu

ranc

eH

ouse

hold

/Per

sona

l Pro

duct

sH

otel

s/R

esta

uran

ts/L

eisu

reH

ealth

Car

eFo

od/D

rink/

Toba

cco

Food

Mar

kets

Dru

gs/B

iote

chno

logy

Div

ersi

fied

Fina

ncia

lsC

onsu

mer

Dur

able

sC

onst

ruct

ion

Con

glom

erat

esC

hem

ical

sC

apita

l Goo

dsBa

nkin

gA

eros

pace

/Def

ense

80

70

60

50

40

30

20

10

0

Chart of INDUSTRY

8 Chapter 2

Page 5: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.12 Using MINITAB, the side-by-side bar charts are:

Unathor ized Use of COmputer Systems

Re

lati

ve

Fre

qu

en

cy

Don't knowNoYes

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

Don't knowNoYes

1999 2006

Chart of 1999, 2006 vs Use

The relative frequency of unauthorized use of computer systems has decreased from 1999 to 2006.

2.14 a. Using MINITAB, the side-by-side graphs are:

Stars

Fre

qu

en

cy

2345

16

12

8

4

0

2345

16

12

8

4

0

Exposure Opportunity

Content Faculty

Chart of Exposure, Opportunity, Content, Faculty vs Stars

From these graphs, one can see that very few of the top 30 MBA programs got 5-stars in

any criteria. In addition, about the same number of programs got 4 stars in each of the 4 criteria. The biggest difference in ratings among the 4 criteria was in the number of programs receiving 3-stars. More programs received 3-stars in Course Content than in any of the other criteria. Consequently, fewer programs received 2-stars in Course Content than in any of the other criteria.

b. Since this chart lists the rankings of only the top 30 MBA programs in the world, it is

reasonable that none of these best programs would be rated as 1-star on any criteria.

Methods for Describing Sets of Data 9

Page 6: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.16 2.18 a. The original data set has 1 + 3 + 5 + 7 + 4 + 3 = 23 observations. b. For the bottom row of the stem-and-leaf display: The stem is 0. The leaves are 0, 1, 2. The numbers in the original data set are 0, 1, and 2. c. The dot plot corresponding to all the data points is:

2.20. a. The measurement class that contains the highest proportion of respondents is “none”.

Sixty-one percent of the respondents said that their companies did not outsource any computer security functions.

b. From the graph, 6% of the respondents indicated that they outsourced between 20% and

40% of their computer security functions.

c. The proportion of the 609 respondents who outsourced at least 40% of computer security functions is .04 + .01 + .01 = .06.

d. The number of the 609 respondents who outsourced less than 20% of computer security

functions is (.27 + .61)*609 = .88(609) = 536.

10 Chapter 2

Page 7: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.22 a. Using MINITAB, the stem-and-leaf display of the data is: Stem-and-Leaf Display: SCORE Stem-and-leaf of SCORE N = 169 Leaf Unit = 1.0 1 6 2 1 6 2 7 2 3 7 8 4 8 4 15 8 66677888899 56 9 00001111111222222222233333333344444444444 (100) 9 55555555555555555555556666666666666666666777777777777777777888888+ 13 10 0000000000000

b. From the stem-and-leaf display, we see that there are only 4 observations with sanitation scores less than the acceptable score of 86. The proportion of ships that have an accepted sanitation standard would be (169 – 4) / 169 = .976.

c. The sanitation score of 84 is in bold in the stem-and-leaf display in part a.

2.24 a. Using MINITAB, the frequency histogram is:

50403020

30

20

10

0

Length

Freq

uenc

y

Methods for Describing Sets of Data 11

Page 8: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

b. Using MINITAB, the frequency histogram is:

2502000150010005000

35

30

25

20

15

10

5

0

Weight

Freq

uenc

y

c. Using MINITAB, the frequency histogram is:

10005000

140

120

100

80

60

40

20

0

DDT

Freq

uenc

y

2.26 Using MINITAB, the two dot plots are:

Dotplot for Arrive-Depart

Yes. Most of the numbers of items arriving at the work center per hour are in the 135 to 165 area. Most of the numbers of items departing the work center per hour are in the 110 to 140 area. Because the number of items arriving is larger than the number of items departing, there will probably be some sort of bottleneck.

12 Chapter 2

Page 9: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.28 a. Using MINITAB, the three frequency histograms are as follows (the same starting point and class interval were used for each):

Histogram of C1 N = 25 Tenth Performance Midpoint Count 4.00 0 8.00 0 12.00 1 * 16.00 5 ***** 20.00 10 ********** 24.00 6 ****** 28.00 0 32.00 2 ** 36.00 0 40.00 1 * Histogram of C2 N = 25 Thirtieth Performance Midpoint Count 4.00 1 * 8.00 9 ********* 12.00 12 ************ 16.00 2 ** 20.00 1 * Histogram of C3 N = 25 Fiftieth Performance Midpoint Count 4.00 3 *** 8.00 15 *************** 12.00 4 **** 16.00 2 ** 20.00 1 * b. The histogram for the tenth performance shows a much greater spread of the observations

than the other two histograms. The thirtieth performance histogram shows a shift to the left—implying shorter completion times than for the tenth performance. In addition, the fiftieth performance histogram shows an additional shift to the left compared to that for the thirtieth performance. However, the last shift is not as great as the first shift. This agrees with statements made in the problem.

Methods for Describing Sets of Data 13

Page 10: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.30 a. A stem-and-leaf display is as follows, where the stems are the units place and the leaves are the decimal places:

Stem Leaves

1 0 0 0 0 1 1 2 2 222 3 4 4 4 4444 5 5 55 6 79

2 1 144 6 7 9 9

3 0 028 9 9

4 1112 5

5 24

6

7 8

8

9

10 1

b. A little more than half (26/49 = .53) of all companies spent less than 2 months in bankruptcy. Only two of the 49 companies spent more than 6 months in bankruptcy. It appears then, in general, the length of time in bankruptcy for firms using "prepacks" is less than that of firms not using "prepacks."

c. A dot diagram will be used to compare the time in bankruptcy for the three types of "prepack" firms:

d. The circled times in part a correspond to companies that were reorganized through a leverage buyout. There does not appear to be any pattern to these points. They appear to be scattered about evenly throughout the distribution of all times.

2.32 Using MINITAB, the stem-and-leaf display for the data is:

Stem-and-leaf of Time N = 25 Leaf Unit = 1.0

3 3 239 7 4 3499 (7) 5 0011469 11 6 34458 6 7 13 4 8 6 2 2 9 5 1 10 2

The numbers in bold represent delivery times associated with customers who subsequently did not place additional orders with the firm. Since there were only 2 customers with delivery times of 68 days or longer that placed additional orders, I would say the maximum tolerable delivery time is about 65 to 67 days. Everyone with delivery times less than 67 days placed additional orders.

14 Chapter 2

Page 11: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.34 a. x∑ = 3 + 8 + 4 + 5 + 3 + 4 + 6 = 33 b. 2x∑ = 32 + 82 + 42 + 52 + 32 + 42 + 62 = 175 c. = (3 − 5)2( 5)x −∑ 2 + (8 − 5)2 + (4 − 5)2 + (5 − 5)2 + (3 − 5)2 + (4 − 5)2

+ (6 − 5)2 = 20 d. = (3 − 2)2( 2)x −∑ 2 + (8 − 2)2 + (4 − 2)2 + (5 − 2)2 + (3 − 2)2 + (4 − 2)2

+ (6 − 2)2 = 71 e. ( )2

x∑ = (3 + 8 + 4 + 5 + 3 + 4 + 6)2 = 332 = 1089 2.36 a. x∑ = 6 + 0 + (−2) + (−1) + 3 = 6 b. 2x∑ = 62 + 02 + (−2)2 + (−1)2 + 32 = 50

c. ( )2

2

5x

x − ∑∑ = 50 − 26

5 = 50 − 7.2 = 42.8

2.38 a. 8510

xx

n= =∑ = 8.5

b. 40016

x = = 25

c. 3545

x = = .78

d. 24218

x = = 13.44

2.40 The median is the middle number once the data have been arranged in order. If n is even, there

is not a single middle number. Thus, to compute the median, we take the average of the middle two numbers. If n is odd, there is a single middle number. The median is this middle number.

A data set with five measurements arranged in order is 1, 3, 5, 6, 8. The median is the middle

number, which is 5. A data set with six measurements arranged in order is 1, 3, 5, 5, 6, 8. The median is the average

of the middle two numbers which is 5 5 102 2+

= = 5.

Methods for Describing Sets of Data 15

Page 12: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.42 a. x = 7 46 6

xn

+ += =∑ 15 = 2.5

Median = 3 32

+ = 3 (mean of 3rd and 4th numbers, after ordering)

Mode = 3

b. x = 2 413 13

x n

+ +=∑ 40

= = 3.08

Median = 3 (7th number, after ordering) Mode = 3

c. x = 51 37 49610 10

xn

+ +=∑ = = 49.6

Median = 48 502

+ = 49 (mean of 5th and 6th numbers, after ordering)

Mode = 50 2.44 a. The sample mean is:

1 529 355 301 ... 63 3757 144.526 26

n

ii

xx

n= + + + +

= = = =∑

The sample median is found by finding the average of the 13th and 14th observations once the data are arranged in order. The 13th and 14th observations are 100 and 105. The average of these two numbers (median) is:

100 105 205median 102.52 2+

= = =

The mode is the observation appearing the most. For this data set, the mode is 70, which appears 3 times. Since the mean is larger than the median, the data are skewed to the right.

b. The sample mean is:

1 11 9 6 ... 4 136 5.2326 26

n

ii

xx

n= + + + +

= = = =∑

The sample median is found by finding the average of the 13th and 14th observations once the data are arranged in order. The 13th and 14th observations are 5 and 5. The average of these two numbers (median) is:

5 5 10median 52 2+

= = =

16 Chapter 2

Page 13: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

The mode is the observation appearing the most. For this data set, the mode is 6, which appears 6 times. Since the mean and median are about the same, the data are somewhat symmetric.

2.46 a. The sample mean is:

1 1.72 2.50 2.16 1.95 37.62 1.88120 20

n

ii

xx

n= + + + ⋅ ⋅ ⋅ +

= = = =∑

The sample average surface roughness of the 20 observations is 1.881.

b. The median is found as the average of the 10th and 11th observations, once the data have been ordered. The ordered data are:

1.06 1.09 1.19 1.26 1.27 1.40 1.51 1.72 1.95 2.03 2.05 2.13 2.13 2.16 2.24 2.31 2.41 2.50 2.57 2.64

The 10th and 11th observations are 2.03 and 2.05. The median is:

2.03 2.05 4.08 2.042 2+

= =

The middle surface roughness measurement is 2.04. Half of the sample measurements

were less than 2.04 and half were greater than 2.04.

c. The data are somewhat skewed to the left. Thus, the median might be a better measure of central tendency than the mean. The few small values in the data tend to make the mean smaller than the median.

2.48 a. Using MINITAB, the stem-and-leaf display is: Stem-and-leaf of PAF N=17 Leaf Unit = 1.0

6 0 000009 8 1 25 (2) 2 45 7 3 13 5 4 0 4 5 4 6 2 3 7 057 b. The median is the middle number once the data are arranged in order. The data arranged in

order are: 0, 0, 0, 0, 0, 9, 12, 15, 24, 25, 31, 33, 40, 62, 70, 75, 77.

The middle number or the median is 24.

c. The mean of the data is x = x

n∑ = 77 33 75 31

17 + + + + = 473

17 = 27.82

Methods for Describing Sets of Data 17

Page 14: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

d. The number occurring most frequently is 0. The mode is 0. e. The mode corresponds to the smallest number. It does not seem to locate the center of the

distribution. Both the mean and the median are in the middle of the stem-and-leaf display. Thus, it appears that both of them locate the center of the data.

2.50 a. The sample mean length is:

1 42.5 44.0 41.5 ... 36.0 6165 42.81144 144

n

ii

xx

n= + + + +

= = = =∑

The average length of the 144 fish is 42.81 cm.

The median is the average of the middle two observations once they have been ordered.

The 72nd and 73rd observations are 45 and 45. The average of these two observations is 45. Half of the fish lengths are less than 45 cm and half are longer.

The mode is 46 cm. This observation occurred 12 times.

b. The sample mean weight is:

1 732 795 547 ... 1433 151159 1049.72144 144

n

ii

xx

n= + + + +

= = = =∑

The average weight of the 144 fish is 1049.72 grams.

The median is the average of the middle two observations once they have been ordered.

The 72nd and 73rd observations are 989 and 1011. The average of these two observations is

989 1,011median 10002+

= =

Half of the fish weights are less than 1000 grams and half are heavier.

There are 2 modes, 886 and 1186. Each of these observations occurred 3 times.

c. The sample mean DDT level is:

1 10 16 23 ... 1.9 3507.1 24.35144 144

n

ii

xx

n= + + + +

= = = =∑

The average DDT level of the 144 fish is 24.35 parts per million.

18 Chapter 2

Page 15: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

The median is the average of the middle two observations once they have been ordered. The 72nd and 73rd observations are 7.1 and 7.2. The average of these two observations is

7.1 7.2median 7.152+

= =

Half of the fish DDT levels are less than 7.15 parts per million and half are greater.

The mode is 12. This observation occurred 8 times. d. From the graph in Exercise 2.24a, the data are skewed to the left. This corresponds to the

relationship between the mean and the median. For data skewed to the left, the mean is less than the median. For the fish lengths, the mean is 42.81 and the median is 45.

e. From the graph in Exercise 2.24b, the data are slightly skewed to the right. This

corresponds to the relationship between the mean and the median. For data skewed to the right, the mean is more than the median. For the fish weights, the mean is 1049.72 and the median is 1000.

f. From the graph in Exercise 2.24c, the data are skewed to the right. This corresponds to the

relationship between the mean and the median. For data skewed to the right, the mean is more than the median. For the fish DDT levels, the mean is 24.35 and the median is 7.15.

2.52 a. Due to the "elite" superstars, the salary distribution is skewed to the right. Since this

implies that the median is less than the mean, the players' association would want to use the median.

b. The owners, by the logic of part a, would want to use the mean. 2.54 a. The sample mean is:

42080

203...4351 ==

++++==

∑=

n

xx

n

ii

The sample median is found by finding the average of the 10th and 11th observations once

the data are arranged in order. The data arranged in order are: 1 1 1 1 1 2 2 3 3 3 4 4 4 5 5 5 6 7 9 13 The 10th and 11th observations are 3 and 4. The average of these two numbers (median) is:

3 4 7median 3.52 2+

= = =

The mode is the observation appearing the most. For this data set, the mode is 1, which

appears 5 times.

Methods for Describing Sets of Data 19

Page 16: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

b. Eliminating the largest number which is 13 results in the following:

The sample mean is:

53.31967

193...4351 ==

++++==

∑=

n

xx

n

ii

The sample median is found by finding the middle observation once the data are arranged

in order. The data arranged in order are: 1 1 1 1 1 2 2 3 3 3 4 4 4 5 5 5 6 7 9 The 10th observation is 3. The median is 3 The mode is the observations appearing the most. For this data set, the mode is 1, which

appears 5 times. By dropping the largest number, the mean is reduced from 4 to 3.53. The median is

reduced from 3.5 to 3. There is no effect on the mode.

c. The data arranged in order are: 1 1 1 1 1 2 2 3 3 3 4 4 4 5 5 5 6 7 9 13 If we drop the lowest 2 and largest 2 observations we are left with: 1 1 1 2 2 3 3 3 4 4 4 5 5 5 6 7

The sample 10% trimmed mean is:

1 1 1 1 ... 7 56 3.516 16

n

ii

xx

n= + + + +

= = = =∑

The advantage of the trimmed mean over the regular mean is that very large and very small

numbers that could greatly affect the mean have been eliminated.

20 Chapter 2

Page 17: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.56 a. s2 =

( )2

2

1

xx

nn

∑∑ =

2208410

10 1

−= 4.8889 s = 4.8889 = 2.211

b. s2 =

( )2

2

1

xx

nn

∑∑=

2100380 `40

40 1

−= 3.3333 s = 3.3333 = 1.826

c. s2 =

( )2

2

1

xx

nn

∑∑=

2171820

20 1

−= .1868 s = .1868 = .432

2.58 a. Range = 42 − 37 = 5

s2 =

( )2

2

1

xx

nn

∑∑=

219979355

5 1

−= 3.7 s = 3.7 = 1.92

b. Range = 100 − 1 = 99

s2 =

( )2

2

1

xx

nn

∑∑=

230325,7959

9 1

−= 1,949.25 s = 1,949.25 = 44.15

c. Range = 100 − 2 = 98

s2 =

( )2

2

1

xx

nn

∑∑=

229520,0338

8 1

−= 1,307.84 s = 1,307.84 = 36.16

2.60 This is one possibility for the two data sets. Data Set 1: 1, 1, 2, 2, 3, 3, 4, 4, 5, 5 Data Set 2: 1, 1, 1, 1, 1, 5, 5, 5, 5, 5

1x = 1 1 2 2 3 3 4 4 5 5 3010 10

x n

+ + + + + + + + += =∑ = 3

2x = 1 1 1 1 1 5 5 5 5 5 3010 10

x n

+ + + + + + + + += =∑ = 3

Therefore, the two data sets have the same mean. The variances for the two data sets are:

( )22

2

21

30110 20101 9

xx

nsn

− −= =

∑∑9

= = 2.2222

( )22

2

22

30110 20101 9

xx

nsn

− −= =

∑∑9

= = 4.4444

Methods for Describing Sets of Data 21

Page 18: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

The dot diagrams for the two data sets are shown below.

2.62 a. Range = 3 − 0 = 3

( )22

2

2

7155

1 5

xx

nsn

− −= =

− −

∑∑1

= 1.3 s = 1.3 = 1.1402

b. After adding 3 to each of the data points, Range = 6 − 3 = 3

( )22

2

2

221025

1 5

xx

nsn

− −= =

∑∑1−

= 1.3 s = 1.3 = 1.1402

c. After subtracting 4 from each of the data points, Range = −1 − (−4) = 3

( )22

2

2

( 13)395

1 5

xx

nsn

−− −

= =− −

∑∑1

= 1.3 s = 1.3 = 1.1402

d. The range, variance, and standard deviation remain the same when any number is added to

or subtracted from each measurement in the data set. 2.64 a. The maximum age is 64. The minimum age is 39. The range is 64 – 39 = 25.

b. The variance is: 2

212

2 1

2494125,764-50 27.822

1 50-1

n

ini

i

xx

nsn

=

=

⎛ ⎞⎜ ⎟⎝ ⎠−

= = =−

∑∑

c. The standard deviation is:

2 27.822 5.275s s= = =

d. Since the standard deviation of the ages of the 50 most powerful women in Europe is 10 years and is greater than that in the U.S. (5.275 years), the age data for Europe is more variable.

22 Chapter 2

Page 19: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.66 a. The maximum weight is 1.1 carats. The minimum weight is .18 carats. The range is 1.1 − .18 = .92 carats.

b. The variance is:

2

22

2

194.32146.19308 .0768

1 308 1

ii

ii

xx

nsn

⎛ ⎞⎜ ⎟⎝ ⎠− −

= =− −

∑∑

= square carats

c. The standard deviation is:

2 .0768 .2772s s= = = carats

d. The standard deviation. This gives us an idea about how spread out the data are in the same units as the original data.

2.68 a. A worker's overall time to complete the operation under study is determined by adding the

subtask-time averages. Worker A

The average for subtask 1 is: 2117

xx

n= =∑ = 30.14

The average for subtask 2 is: 217

xx

n= =∑ = 3

Worker A's overall time is 30.14 + 3 = 33.14. Worker B

The average for subtask 1 is: 2137

xx

n= =∑ = 30.43

The average for subtask 2 is: 297

xx

n= =∑ = 4.14

Worker B's overall time is 30.43 + 4.14 = 34.57. b. Worker A

s =

( )27

2 21164557 15.8095

1 7 1

xx

nn

− −= =

− −

∑∑ = 3.98

Worker B

s =

( )22

2 21364877 .9524

1 7 1

xx

nn

− −= =

− −

∑∑ = .98

c. The standard deviations represent the amount of variability in the time it takes the worker

to complete subtask 1.

Methods for Describing Sets of Data 23

Page 20: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

d. Worker A

s =

( )22

2 21677 .6667

1 7 1

xx

nn

− −= =

− −

∑∑ = .82

Worker B

s =

( )22

2 291477 4.4762

1 7 1

xx

nn

− −= =

− −

∑∑= 2.12

e. I would choose workers similar to worker B to perform subtask 1. Worker B has a slightly

higher average time on subtask 1 (A: x = 30.14, B: x = 30.43). But, Worker B has a smaller variability in the time it takes to complete subtask 1 (part b). He or she is more consistent in the time needed to complete the task.

I would choose workers similar to Worker A to perform subtask 2. Worker A has a smaller

average time on subtask 2 (A: x = 3, B: x = 4.14). Worker A also has a smaller variability in the time needed to complete subtask 2 (part d).

2.70 Since no information is given about the data set, we can only use Chebyshev's Rule. a. Nothing can be said about the percentage of measurements which will fall between x − s and x + s. b. At least 3/4 or 75% of the measurements will fall between x − 2s and x + 2s. c. At least 8/9 or 89% of the measurements will fall between x − 3s and x + 3s.

2.72 a. x = 20625

xn

=∑ = 8.24

s2 =

( )22

2 206177825

1 25

xx

nn

− −=

− −

∑∑1

= 3.357 s = 2s = 1.83

b.

Interval

Number of Measurements in Interval

Percentage

x ± s, or (6.41, 10.07) 18 18/25 = .72 or 72%

x ± 2s, or (4.58, 11.90) 24 24/25 = .96 or 96%

x ± 3s, or (2.75, 13.73) 25 25/25 = 1 or 100% c. The percentages in part b are in agreement with Chebyshev's Rule and agree fairly well

with the percentages given by the Empirical Rule.

24 Chapter 2

Page 21: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

d. Range = 12 − 5 = 7 s ≈ range/4 = 7/4 = 1.75 The range approximation provides a satisfactory estimate of s = 1.83 from part a. 2.74 From Chebyshev’s Theorem, we know that at least ¾ or 75% of all observations will fall within

2 standard deviations of the mean. From Exercise 2.47, x = .631. From Exercise 2.66, s = .2772. This interval is:

2 .631 2(.2772) .631 .5544 (.0766, 1.1854)x s± ⇒ ± ⇒ ± ⇒

2.76 a. From the information given, we have x = 375 and s = 25. From Chebyshev's Rule, we know that at least three-fourths of the measurements are within the interval:

x ± 2s, or (325, 425)

Thus, at most one-fourth of the measurements exceed 425. In other words, more than 425 vehicles used the intersection on at most 25% of the days.

b. According to the Empirical Rule, approximately 95% of the measurements are within the

interval:

x ± 2s, or (325, 425)

This leaves approximately 5% of the measurements to lie outside the interval. Because of the symmetry of a mound-shaped distribution, approximately 2.5% of these will lie below 325, and the remaining 2.5% will lie above 425. Thus, on approximately 2.5% of the days, more than 425 vehicles used the intersection.

2.78 a. Since the sample mean (18.2) is larger than the sample median (15), it indicates that the

distribution of years is skewed to the right. In addition, the maximum number of years is 50 and the minimum is 2. If the distribution were symmetric, the mean and median should be about halfway between these two numbers. Halfway between the maximum and minimum values is 26, which is much larger than either the mean or the median.

b. The standard deviation can be estimated by the range divided by either 4 or 6. For this

distribution, the range is: Range = Largest − smallest = 50 − 2 = 48. Dividing the range by 4, we get an estimate of the standard deviation to be 48/4 = 12. Dividing the range by 6, we get an estimate of the standard deviation to be 48/6 = 8. Thus, the standard deviation should be somewhere between 8 and 12. For this problem, the

standard deviation is s = 10.64. This value falls in the estimated range of 8 to 12.

Methods for Describing Sets of Data 25

Page 22: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

c. First, we calculate the number of standard deviations from the mean the value of 40 years is. To do this, we first subtract the mean and then divide by the value of the standard deviation.

Number of standard deviations is 40 40 18.210.64

xs− −

= = 2.05 ≈ 2

Using Chebyshev's Rule, we know that at most 1/k2 or 1/22 = 1/4 of the data will be more

than 2 standard deviations from the mean. Thus, this would indicate that at most 25% of the Generation Xers responded with 40 years or more.

Next, we calculate the number of standard deviations from the mean the value of 8 years is.

Number of standard deviations is 8 8 1810.64

x .2s− −

= = −.96 ≈ -1

Using Chebyshev's Rule, we get no information about the data within 1 standard deviation

of the mean. However, we know the median (15) is more than 8. By definition, 50% of the data are larger than the median. Thus, at least 50% of the Generation Xers responded with 8 years or more. No additional information can be obtained with the information given.

2.80 a. Using MINITAB, the frequency histogram for the time in bankruptcy is:

10987654321

20

10

0

Time in Bankrupt

Freq

uenc

y

The Empirical Rule is not applicable because the data are not mound shaped.

26 Chapter 2

Page 23: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

b. Using MINITAB, the descriptive measures are:

Descriptive Statistics: Time in Bankrupt

Variable N Mean Median TrMean StDev SE Mean Time in 49 2.549 1.700 2.333 1.828 0.261

Variable Minimum Maximum Q1 Q3 Time in 1.000 10.100 1.350 3.500

From Chebyshev’s Theorem, we know that at least 75% of the observations will fall within 2 standard deviations of the mean. This interval is:

2 2.549 2(1.828) 2.549 3.656 ( 1.107, 6.205)x s± ⇒ ± ⇒ ± ⇒ −

c. There are 47 of the 49 observations within this interval. The percentage would be (47/49)*100% = 95.9%. This agrees with Chebyshev’s Theorem (at least 75%0. It also agrees with the Empirical Rule (approximately 95%).

d. From the above interval we know that about 95% of all firms filing for prepackaged

bankruptcy will be in bankruptcy between 0 and 6.2 months. Thus, we would estimate that a firm considering filing for bankruptcy will be in bankruptcy up to 6.2 months.

2.82 a. Since it is given that the distribution is mound-shaped, we can use the Empirical Rule. We

know that 1.84% is 2 standard deviations below the mean. The Empirical Rule states that approximately 95% of the observations will fall within 2 standard deviations of the mean and, consequently, approximately 5% will lie outside that interval. Since a mound-shaped distribution is symmetric, then approximately 2.5% of the day's production of batches will fall below 1.84%.

b. If the data are actually mound-shaped, it would be extremely unusual (less than 2.5%) to

observe a batch with 1.80% zinc phosphide if the true mean is 2.0%. Thus, if we did observe 1.8%, we would conclude that the mean percent of zinc phosphide in today's production is probably less than 2.0%.

2.84 a. Since we do not have any idea of the shape of the distribution of SAT-Math score

changes, we must use Chebyshev’s Theorem. We know that at least 8/9 of the observations will fall within 3 standard deviations of the mean. This interval would be:

3 19 3(65) 19 195 ( 176, 214)x s± ⇒ ± ⇒ ± ⇒ − Thus, for a randomly selected student, we could be pretty sure that this student’s score

would be any where from 176 points below his/her previous SAT-Math score to 214 points above his/her previous SAT-Math score.

b. Since we do not have any idea of the shape of the distribution of SAT-Verbal score changes, we must use Chebyshev’s Theorem. We know that at least 8/9 of the

observations will fall within 3 standard deviations of the mean. This interval would be:

3 7 3(49) 7 147 ( 140, 154)x s± ⇒ ± ⇒ ± ⇒ −

Methods for Describing Sets of Data 27

Page 24: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

Thus, for a randomly selected student, we could be pretty sure that this student’s score

would be any where from 140 points below his/her previous SAT-Verbal score to 154 points above his/her previous SAT-Verbal score.

c. A change of 140 points on the SAT-Math would be a little less than 2 standard deviations

from the mean. A change of 140 points on the SAT-Verbal would be a little less than 3 standard deviations from the mean. Since the 140 point change for the SAT-Math is not as big a change as the 140 point on the SAT-Verbal, it would be most likely that the score was a SAT-Math score.

2.86 a. z = 40 305

x x s− −

= = 2 (sample) 2 standard deviations above the mean.

b. z = 90 892

x μσ− −

= = .5 (population) .5 standard deviations above the mean.

c. z = 50 505

x μσ− −

= = 0 (population) 0 standard deviations above the mean.

d. z = 20 304

x x s− −

= = −2.5 (sample) 2.5 standard deviations below the mean.

2.88 The 50th percentile of a data set is the observation that has half of the observations less than it.

Another name for the 50th percentile is the median. 2.90 Since the element 40 has a z-score of −2 and 90 has a z-score of 3,

−2 = 40 μσ− and 3 = 90 μ

σ−

⇒ −2σ = 40 − μ ⇒ 3σ = 90 − μ ⇒ μ − 2σ = 40 ⇒ μ + 3σ = 90 ⇒ μ = 40 + 2σ By substitution, 40 + 2σ + 3σ = 90 ⇒ 5σ = 50 ⇒ σ = 10 By substitution, μ = 40 + 2(10) = 60 Therefore, the population mean is 60 and the standard deviation is 10. 2.92 The percentile ranking of the age of 25 years would be 100% − 73.5% = 26.5%.

28 Chapter 2

Page 25: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.94 a. From Exercise 2.77, x = 94.91 and s = 4.83. The z-score for an observation of 78 is:

78 94.91 3.504.83

x xzs− −

= = = −

This z-score indicates that an observation of 78 is 3.5 standard deviations below the mean. Very few observations will be lower than this one. b. The z-score for an observation of 98 is:

98 94.91 0.634.83

x xzs− −

= = =

This z-score indicates that an observation of 98 is .63 standard deviations above the mean. This score is not an unusual observation in the data set. 2.96 a. From the problem, μ = 2.7 and σ = .5

z = x - μσ

⇒ zσ = x − μ ⇒ x = μ + zσ

For z = 2.0, x = 2.7 + 2.0(.5) = 3.7 For z = −1.0, x = 2.7 − 1.0(.5) = 2.2 For z = .5, x = 2.7 + .5(.5) = 2.95 For z = −2.5, x = 2.7 − 2.5(.5) = 1.45 b. For z = −1.6, x = 2.7 − 1.6(.5) = 1.9 c. If we assume the distribution of GPAs is

approximately mound-shaped, we can use the Empirical Rule.

From the Empirical Rule, we know that ≈.025

or ≈2.5% of the students will have GPAs above 3.7 (with z = 2). Thus, the GPA corresponding to summa cum laude (top 2.5%) will be greater than 3.7 (z > 2).

We know that ≈.16 or 16% of the students will have GPAs above 3.2 (z = 1). Thus, the

limit on GPAs for cum laude (top 16%) will be greater than 3.2 (z > 1). We must assume the distribution is mound-shaped.

Methods for Describing Sets of Data 29

Page 26: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.98 a. Since the data are approximately mound-shaped, we can use the Empirical Rule. On the blue exam, the mean is 53% and the standard deviation is 15%. We know that approximately 68% of all students will score within 1 standard deviation of the mean. This interval is:

53 (15) (38, 68)x s± ⇒ ± ⇒

About 95% of all students will score within 2 standard deviations of the mean. This interval is:

2 53 2(15) 53 30 (23, 83)x s± ⇒ ± ⇒ ± ⇒

About 99.7% of all students will score within 3 standard deviations of the mean. This interval is:

3 53 3(15) 53 45 (8, 98)x s± ⇒ ± ⇒ ± ⇒

b. Since the data are approximately mound-shaped, we can use the Empirical Rule. On the red exam, the mean is 39% and the standard deviation is 12%. We know that approximately 68% of all students will score within 1 standard deviation of the mean. This interval is:

39 (12) (27, 51)x s± ⇒ ± ⇒

About 95% of all students will score within 2 standard deviations of the mean. This interval is:

2 39 2(12) 39 24 (15, 63)x s± ⇒ ± ⇒ ± ⇒

About 99.7% of all students will score within 3 standard deviations of the mean. This interval is:

3 39 3(12) 39 36 (3, 75)x s± ⇒ ± ⇒ ± ⇒

c. The student would have been more likely to have taken the red exam. For the blue exam, we know that approximately 95% of all scores will be from 23% to 83%. The observed 20% score does not fall in this range. For the blue exam, we know that approximately 95% of all scores will be from 15% to 63%. The observed 20% score does fall in this range. Thus, it is more likely that the student would have taken the red exam.

2.100 The 25th percentile, or lower quartile, is the measurement that has 25% of the measurements

below it and 75% of the measurements above it. The 50th percentile, or median, is the measurement that has 50% of the measurements below it and 50% of the measurements above it. The 75th percentile, or upper quartile, is the measurement that has 75% of the measurements below it and 25% of the measurements above it.

30 Chapter 2

Page 27: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.102 a. Median is approximately 4. b. QL is approximately 3 (Lower Quartile) QU is approximately 6 (Upper Quartile) c. IQR = QU − QL ≈ 6 − 3 = 3

d. The data set is skewed to the right since the right whisker is longer than the left, there is one outlier, and there are two potential outliers.

e. 50% of the measurements are to the right of the median and 75% are to the left of the upper

quartile. f. There are two potential outliers, 12 and 13. There is one outlier, 16. 2.104 a. From the problem, x = 52.33 and s = 9.22. The highest salary is 75 (thousand).

The z-score is z = x xs− = 75 52.33

9.22− = 2.46

Therefore, the highest salary is 2.46 standard deviations above the mean. The lowest salary is 35.0 (thousand).

The z-score is z = x xs− = 35.0 52.33

9.22− = −1.88

Therefore, the lowest salary is 1.88 standard deviations below the mean. The mean salary offer is 52.33 (thousand).

The z-score is z = x xs− = 52.33 52.33

9.22− = 0

The z-score for the mean salary offer is 0 standard deviations from the mean. No, the highest salary offer is not unusually high. For any distribution, at least 8/9 of the

salaries should have z-scores between −3 and 3. A z-score of 2.46 would not be that unusual.

Methods for Describing Sets of Data 31

Page 28: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

b. Using MINITAB, the box plot is:

Since no salaries are outside the inner fences, none of them are potentially faulty observations. 2.106 Using MINITAB, the side-by-side box plots are:

GROUP

AG

E

321

65

60

55

50

45

40

From the boxplots, there appears to be one outlier in the third group. 2.108 a. First, we will compute the mean and standard deviation.

The sample mean is:

1 393 5.2475

n

ii

xx

n== = =∑

The sample variance is:

2

22

2

393594375 52.482

1 75 1

ii

ii

xx

nsn

⎛ ⎞⎜ ⎟⎝ ⎠− −

= = =− −

∑∑

32 Chapter 2

Page 29: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

The standard deviation is: 2 52.482 7.244s s= = =

Since this data set is highly skewed, we will use 2 standard deviations from the mean as the cutoff for outliers. Z-scores with values greater than 2 in absolute value are considered outliers. An observation with a z-score of 2 would have the value:

5.242 2(7.244) 5.24 14.488 5.24 19.7287.244

x x xz x x xs− −

= ⇒ = ⇒ = − ⇒ = − ⇒ =

An observation with a z-score of -2 would have the value:

5.242 2(7.244) 5.24

7.24414.488 5.24 9.248

x x xz xs

x x

− −= ⇒ − = ⇒ − = −

⇒ − = − ⇒ = −

Thus any observation that is greater than to 19.728 or less than -9.248 would be considered an outlier. In this data set there would be 4 outliers: 21, 21, 25, 48.

b. Deleting these 4 outliers, we will recalculate the mean, median, variance, and standard

deviation. The median for the original data set is the middle number once they have been arranged in order and is the 38th observation which is 3.

The new mean is:

1 278 3.9271

n

ii

xx

n== = =∑

The new sample variance is:

2

22

2

278213271 14.907

1 71 1

ii

ii

xx

nsn

⎛ ⎞⎜ ⎟⎝ ⎠− −

= = =− −

∑∑

The new standard deviation is:

2 14.907 3.861s s= = =

The new median is the 36th observation once the data have been arranged in order and is 3. In the original data set, the mean is 5.24, the standard deviation is 7.244, and the median is 3. In the revised data set, the mean is 3.92, the standard deviation is 3.861, and the median is 3. The mean has been decreased, the standard deviation has been almost halved, but the median stays the same.

Methods for Describing Sets of Data 33

Page 30: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.110 For Perturbed Intrinsics, but no Perturbed Projections:

1 1.0 1.3 3.0 1.5 1.3 8.1 1.625 5

n

ii

xx

n= + + + +

= = = =∑

2

212

2 1

8.115.63 2.5085 .6271 4 4

n

ini

ii

xx

nsn

=

=

⎛ ⎞⎜ ⎟⎝ ⎠− −

= = =−

∑∑

=

2 .627 .792s s= = = The z-score corresponding to a value of 4.5 is

4.5 1.62 3.63.792

x xzs− −

= = =

Since this z-score is greater than 3, we would consider this an outlier for perturbed intrinsics, but no perturbed projections. For Perturbed Projections, but no Perturbed Intrinsics:

1 22.9 21.0 34.4 29.8 17.7 125.8 25.165 5

n

ii

xx

n= + + + +

= = = =∑

2

212

2 1

125.83350.1 184.9725 46.2431 4 4

n

ini

ii

xx

nsn

=

=

⎛ ⎞⎜ ⎟⎝ ⎠− −

= = = =−

∑∑

2 46.243 6.800s s= = = The z-score corresponding to a value of 4.5 is

4.5 25.16 3.0386.800

x xzs− −

= = = −

Since this z-score is less than -3, we would consider this an outlier for perturbed projections, but no perturbed intrinsics. Since the z-score corresponding to 4.5 for the perturbed projections, but no perturbed intrinsics is smaller than that for perturbed intrinsics, but no perturbed projections, it is more likely that the that the type of camera perturbation is perturbed projections, but no perturbed intrinsics.

34 Chapter 2

Page 31: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.112 Using MINITAB, a scatterplot of the data is:

876543210-1

15

10

5

0

Var1

Var

2

2.114 Using MINITAB, the scatterplot of the data is:

1050

550

450

350

250

150

50

Offices

Law

yers

As the number of offices increases, the number of lawyers also tends to increase. 2.116 a. Using MINITAB, the scatterplot is:

40302010

20

15

10

5

10th

30th

It appears that as the completion time for the 10th trial increases, the completion time for the 30th trial decreases.

Methods for Describing Sets of Data 35

Page 32: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

b. Using MINITAB, the scatterplot is:

40302010

20

15

10

5

10th

50th

It appears that as the completion time for the 10th trial increases, the completion time for the 50th trial increases.

c. Using MINITAB, the scatterplot is:

2015105

20

15

10

5

30th

50th

It appears that as the completion time for the 30th trial increases, the completion time for the 50th trial increases.

36 Chapter 2

Page 33: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.118 Using MINITAB, the scatterplot of the data is:

T ime

Ma

ss

6050403020100

7

6

5

4

3

2

1

0

Scatterplot of Mass vs Time

There is evidence to indicate that the mass of the spill tends to diminish as time increases. As time is getting larger, the mass is decreasing.

2.120 The mean is sensitive to extreme values in a data set. Therefore, the median is preferred to the

mean when a data set is skewed in one direction or the other. 2.122 a. If we assume that the data are about mound-shaped, then any observation with a z-score greater than 3 in absolute value would be considered an outlier. From Exercise

1.121, the z-score corresponding to 50 is −1, the z-score corresponding to 70 is 1, and the z-score corresponding to 80 is 2. Since none of these z-scores is greater than 3 in absolute value, none would be considered outliers.

b. From Exercise 1.121, the z-score corresponding to 50 is −2, the z-score corresponding to

70 is 2, and the z-score corresponding to 80 is 4. Since the z-score corresponding to 80 is greater than 3, 80 would be considered an outlier.

c. From Exercise 1.121, the z-score corresponding to 50 is 1, the z-score corresponding to 70

is 3, and the z-score corresponding to 80 is 4. Since the z-scores corresponding to 70 and 80 are greater than or equal to 3, 70 and 80 would be considered outliers.

d. From Exercise 1.121, the z-score corresponding to 50 is .1, the z-score corresponding to 70

is .3, and the z-score corresponding to 80 is .4. Since none of these z-scores is greater than 3 in absolute value, none would be considered outliers.

Methods for Describing Sets of Data 37

Page 34: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

38 Chapter 2

2.124 a. x∑ = 4 + 6 + 6 + 5 + 6 + 7 = 34

2x∑ = 42 + 62 + 62 + 52 + 62 + 72 = 198

346

xx

n= =∑ = 5.67

( )22

2

2

34198 5.333361 6 1 5

xx

nsn

− −= = =

− −

∑∑ = 1.0667

s = 1.067 = 1.03 b. x∑ = −1 + 4 + (−3) + 0 + (−3) + (−6) = −9

2x∑ = (−1)2 + 42 + (−3)2 + 02 + (−3)2 + (−6)2 = 71

96

xx

n−

= =∑ = -$1.5

( )22

2

2

( 9)71 57.561 6 1 5

xx

nsn

−− −

= = =− −

∑∑ = 11.5 dollars squared

s = 11.5 = $3.39

c. 3 4 2 1 15 5 5 5 16

x = + + + +∑ = 2.0625

2 2 2 2 2

2 3 4 2 1 15 5 5 5 16

x ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞= + + + +⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠

∑ = 1.2039

2.06255

xx

n= =∑ = .4125%

( )22

2

2

2.06251.2039 .353151 5 1 4

xx

nsn

− −= = =

− −

∑∑ = .0883% squared

s = .0883 = .30% d. (a) Range = 7 − 4 = 3 (b) Range = $4 − ($-6) = $10

(c) Range = 4 1 64 5 59% % % % %5 16 80 80 80

− = − = = .7375%

2.126 σ ≈ range/4 = 20/4 = 5

Page 35: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.128 Using MINITAB, a pie chart of the data is:

9.8%true

90.2%false

C ategoryfalsetrue

Pie Chart of defect

A response of ‘true’ means the software contained defective code. Thus, only 9.8% of the

modules contained defective software code. 2.130 The z-score would be:

408 603.7 1.06185.4

x xzs− −

= = = −

Since this value is not very big, this is not an unusual value to observe. 2.132 a. The variable of interest is opinion of book reviews. The values could be ‘would not

recommend’, ‘cautious or very little recommendation’, ‘little or no preference’, ‘favorable/recommended’, and ‘outstanding/significant contribution’. Since these responses are not numerical, the variable is quantitative.

b. Most of the books (63%) received a "favorable/recommended" review. About the same

percentage of books received the following reviews: "cautious or very little recommendation" (10%), "little or no preference" (9%), and "outstanding/significant contribution" (12%). Only 5% of the books received "would not recommend" reviews.

c. If the top two categories are added together, the percent recommended is 75% (actually

slightly higher than 75%). This agrees with the study. 2.134 a. To display the status, we use a pie chart. From the pie chart,

we see that 58% of the Beanie babies are retired and 42% are current.

Methods for Describing Sets of Data 39

Page 36: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

b. Using Minitab, a histogram of the values is:

Most (40 of 50) Beanie babies have values less than $100. Of the remaining 10, 5 have

values between $100 and $300, 1 has a value between $300 and $500, 1 has a value between $500 and $700, 2 have values between $700 and $900, and 1 has a value between $1900 and $2100.

c. A plot of the value versus the age of the Beanie Baby is as follows:

From the plot, it appears that as the age increases, the value tends to increase. 2.136 a. Using MINITAB, the stem-and-leaf display is: Stem-and-leaf of C1 Leaf Unit = 0.10 N = 46 4 0 34 4 4 (25) 0 5 5 5 5 5 5 5 556666 6 6 6 7 7 7 7 7 8 8 8 8 9 16 1 000011222 3 34 4 1 7 7 2 2 2 2 2 3 2 3 9 1 4 1 4 7

40 Chapter 2

Page 37: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

b. The leaves that represent those brands that carry the American Dental Association seal are circled above.

c. It appears that the cost of the brands approved by the ADA tend to have the lower costs.

Thirteen of the twenty brands approved by the ADA, or (13/20) × 100% = 65% are less than the median cost.

2.138 a. Using MINITAB, the summary statistics are:

Descriptive Statistics: Marketing, Engineering, Accounting, Total

Variable N Mean Median TrMean StDev SE Mean Marketin 50 4.766 5.400 4.732 2.584 0.365 Engineer 50 5.044 4.500 4.798 3.835 0.542 Accounti 50 3.652 0.800 2.548 6.256 0.885 Total 50 13.462 13.750 13.043 6.820 0.965

Variable Minimum Maximum Q1 Q3 Marketin 0.100 11.000 2.825 6.250 Engineer 0.400 14.400 1.775 7.225 Accounti 0.100 30.000 0.200 3.725 Total 1.800 36.200 8.075 16.600

b. The z-scores corresponding to the maximum time guidelines developed for each

department and the total are as follows:

Marketing: z = 6.5 4.772.58

x xs− −

= = .67

Engineering: z = 7.0 5.043.84

x xs− −

= = .51

Accounting: z = 8.5 3.656.26

x xs− −

= = .77

Total: z = 17 13.466.82

x xs− −

= = .52

c. To find the maximum processing time corresponding to a z-score of 3, we substitute in the

values of z, , and s into the z formula and solve for x.

z = x x x x zs x x zss−

⇒ − = ⇒ = +

Marketing: x = 4.77 + 3(2.58) = 4.77 + 7.74 = 12.51 None of the orders exceed this time. Engineering: x = 5.04 + 3(3.84) = 5.04 + 11.52 = 16.56 None of the orders exceed this time. These both agree with both the Empirical Rule and Chebyshev's Rule.

Methods for Describing Sets of Data 41

Page 38: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

Accounting: x = 3.65 + 3(6.26) = 3.65 + 18.78 = 22.43 One of the orders exceeds this time or 1/50 = .02. Total: x = 13.46 + 3(6.82) = 13.46 + 20.46 = 33.92 One of the orders exceeds this time or 1/50 = .02. These both agree with Chebyshev's Rule but not the Empirical Rule. Both of these last two

distributions are skewed to the right. d. Marketing: x = 4.77 + 2(2.58) = 4.77 + 5.16 = 9.93 Two of the orders exceed this time or 2/50 = .04. Engineering: x = 5.04 + 2(3.84) = 5.04 + 7.68 = 12.72 Two of the orders exceed this time or 2/50 = .04. Accounting: x = 3.65 + 2(6.26) = 3.65 + 12.52 = 16.17 Three of the orders exceed this time or 3/50 = .06. Total: x = 13.46 + 2(6.82) = 13.46 + 13.64 = 27.10 Two of the orders exceed this time or 2/50 = .04. All of these agree with Chebyshev's Rule but not the Empirical Rule. e. No observations exceed the guideline of 3 standard deviations for both Marketing and

Engineering. One observation exceeds the guideline of 3 standard deviations for both Accounting (#23, time = 30.0 days) and Total (#23, time = 36.2 days). Therefore, only (1/10) × 100% of the "lost" quotes have times exceeding at least one of the 3 standard deviation guidelines.

Two observations exceed the guideline of 2 standard deviations for both Marketing (#31,

time = 11.0 days and #48, time = 10.0 days) and Engineering (#4, time = 13.0 days and #49, time = 14.4 days). Three observations exceed the guideline of 2 standard deviations for Accounting (#20, time = 22.0 days; #23, time = 30.0 days; and #36, time = 18.2 days). Two observations exceed the guideline of 2 standard deviations for Total (#20, time = 30.2 days and #23, time = 36.2 days). Therefore, (7/10) × 100% = 70% of the "lost" quotes have times exceeding at least one the 2 standard deviation guidelines.

We would recommend the 2 standard deviation guideline since it covers 70% of the lost

quotes, while having very few other quotes exceed the guidelines. 2.140 a. First, construct a relative frequency distribution for the departments.

Class Department Frequency Relative Frequency 1 Production 13 .241 2 Maintenance 31 .574 3 Sales 3 .056 4 R & D 2 .037 5 Administration 5 .093 TOTAL 54 1.001

42 Chapter 2

Page 39: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

The Pareto diagram is: From the diagram, it is evident that the departments with the worst safety record are Maintenance and Production. b. First, construct a relative frequency

distribution for the type of injury in the maintenance department.

Class Injury Frequency Relative Frequency 1 Burn 6 .194 2 Back strain 5 .161 3 Eye damage 2 .065 4 Cuts 10 .323 5 Broken arm 2 .065 6 Broken leg 1 .032 7 Concussion 3 .097 8 Hearing loss 2 .065 TOTAL 31 1.002

The Pareto diagram is: From the Pareto diagram, it is

evident that cuts is the most prevalent type of injury. Burns and back strain are the next most prevalent types of injuries.

2.142 a. Using MINITAB, the descriptive statistics are:

Descriptive Statistics: MPG

Variable N Mean Median TrMean StDev SE Mean MPG 36 40.056 40.000 40.063 2.177 0.363

Variable Minimum Maximum Q1 Q3 MPG 35.000 45.000 39.000 41.000

Methods for Describing Sets of Data 43

Page 40: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

The mean is 40.056 and the standard deviation is 2.177. Both of these measures are measured in the same units as the original data, which are miles per gallon.

b. Since the sample mean is a good estimate of the population mean, the manufacturer should

be satisfied. The sample mean is 40.056 which is greater than 40. c. The range of the data set is 45 − 35 = 10. Using Chebyshev's Rule, the range should cover

approximately 6 standard deviations. Thus, a good estimate of the standard deviation would be 10/6 = 1.67. Using the Empirical Rule, the range should cover approximately 4 standard deviations. Thus, a good estimate of the standard deviation would be 10/4 = 2.5 The given standard deviation is 2.177 which is between these two estimates. Thus, it is a reasonable value.

d. Using MINITAB, the frequency histogram is (the relative frequency histogram would have

the same shape):

4544434241403938373635

9

8

7

6

5

4

3

2

1

0

MPG

Freq

uenc

y

Yes, the data appear to be mound-shaped. e. Because the data are mound-shaped, we can use the Empirical Rule. We would expect

approximately 68% of the data within the interval x ± s, approximately 95% of the data within the interval x ± 2s, and approximately all of the data within the interval x ± 3s.

f. The interval x ± s is 40.056 ± 2.177 or (37.879, 42.233). Twenty-seven of the

observations fall in this interval or 27/36 = .75 or 75%. This number is a little larger than 68%.

The interval x ± 2s is 40.056 ± 2(2.177) or (35.702, 44.410). Thirty-four of the

observations fall in this interval or 34/36 = .94 or 94%. This number is very close to 95%. The interval x ± 3s is 40.056 ± 3(2.177) or (33.525, 46.587). Thirty-six of the

observations fall in this interval or 36/36 = 1.00 or 100%. This number is the same as all of the observations.

44 Chapter 2

Page 41: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

2.144 a. Both the height and width of the bars (peanuts) change. Thus, some readers may tend to equate the area of the peanuts with the frequency for each year.

b. The frequency bar chart is:

Methods for Describing Sets of Data 45

Page 42: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

The Kentucky Milk Case (To accompany Chapters 1–2)

There are many things that could be included in a report about the possibility of collusion. I have concentrated on the incumbency rates, bid levels and dispersion, and average winning bids. With the data available, no comparison of market share can be made since there was so much missing data. Actually, with the data available, the exact analysis cannot be made, since only the winning bid information is provided. Thus, we have no idea what the losing bids were. I will present what I think is a reasonable solution. This is by no means the only solution to the case. Many other presentations could also be used. Incumbency Rates The incumbency rate is the percent of the school districts that are won by the same vendor who won the previous year. A table containing the incumbency rates is included as well as a plot. Notice in the plot that the incumbency rates in the Tri-county market is higher than that in the Surrounding market. From 1985 through 1988, the incumbency rate for the Tri-county market was never lower than .923, while in the same period in the Surrounding market, the incumbency rate was never higher than .730. This implies the possibility of collusion in the Tri-county market. Surrounding Market Tri-county Market Year Number of

Districts Same

Vendors Incumbency

Rate Number of

Districts Same

VendorsIncumbency

Rate 1984 26 16 .615 10 8 .800 1985 27 19 .704 12 12 1.000 1986 32 19 .594 13 13 1.000 1987 37 27 .730 13 12 .923 1988 37 25 .676 13 13 1.000 1989 37 23 .622 13 9 .692 1990 34 24 .706 13 10 .769 1991 5 3 .600 13 11 .846

46 The Kentucky Milk Case

Page 43: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

The plot of the incumbency rates is: Bid Levels and Dispersion Since we only have access to the winning bids in each of the school districts, we cannot make a true analysis of the bid levels and dispersions. As a compromise, I have used the winning bids of the two dairies in question—Trauth and Meyer. I have looked at only the winning bids of these two dairies in both the Tri-county market and in the Surrounding market. If there was no collusion, then the winning bids and the dispersions of the winning bids should be similar in the two markets for the two dairies. I looked at the box plots of the winning bids of the two dairies in each market for each type of milk: whole white, lowfat white and lowfat chocolate. I have included only a few of the box plots as illustrations. Those included are for 1985 and 1986.

The Kentucky Milk Case 47

Page 44: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

1985 Winning Bids: WHOLE LOWFAT LOWFAT OBS MARKET WINNER WHITE WHITE CHOCOLATE 1 SUR MEYER 0.1280 0.1250 0.1315 2 SUR TRAUTH 0.1200 0.1110 0.1090 3 SUR TRAUTH . 0.1079 0.1079 4 SUR TRAUTH . 0.1190 0.1210 5 SUR MEYER 0.1225 0.1130 0.1099 6 SUR TRAUTH 0.1230 0.1130 0.1120 7 SUR MEYER 0.1250 0.1145 0.1140 8 TRI TRAUTH 0.1440 0.1440 . 9 TRI TRAUTH 0.1450 0.1350 . 10 TRI MEYER 0.1410 0.1410 0.1410 11 TRI TRAUTH 0.1393 0.1393 . 12 TRI MEYER 0.1340 0.1340 0.1340 13 TRI MEYER 0.1445 0.1345 0.1395 14 TRI MEYER . 0.1345 . 15 TRI TRAUTH 0.1449 0.1349 0.1399 16 TRI TRAUTH . 0.1299 0.1299 17 TRI MEYER 0.1480 0.1480 0.1480 18 TRI TRAUTH 0.1310 0.1290 . 19 TRI MEYER . 0.1380 . 20 TRI TRAUTH 0.1435 0.1335 .

Box Plots for Whole White Milk—1985

MA RKET

WW

BID

TRI-C O UNTYSURRO UND

0.150

0.145

0.140

0.135

0.130

0.125

0.120

Boxplots for Whole White Milk - 1985

48 The Kentucky Milk Case

Page 45: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

Box Plots for Lowfat White Milk—1985

MA RKET

LF

WB

ID

TRI-C O UNTYSURRO UND

0.15

0.14

0.13

0.12

0.11

Boxplots for Lowfat White Milk - 1985

Box Plots for Lowfat Chocolate Milk—1985

MA RKET

LF

CB

ID

TRI-C O UNTYSURRO UND

0.15

0.14

0.13

0.12

0.11

Boxplots for Lowfat Chocolate Milk - 1985

The Kentucky Milk Case 49

Page 46: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

For each type of milk, the mean and median winning bids for the Tri-county market were higher than the corresponding winning bids in the Surrounding market. Also, the dispersion, indicated by the width of the boxes and the length of the whiskers, for the Surrounding market is larger than for the Tri-county market in most cases. This is indicative of collusion in the Tri-county market. This same pattern also existed in 1986.

1986 Winning Bids: WHOLE LOWFAT LOWFAT OBS MARKET WINNER WHITE WHITE CHOCOLATE 1 SUR TRAUTH 0.1195 0.1100 0.1085 2 SUR TRAUTH 0.1330 0.1240 0.1290 3 SUR TRAUTH 0.1140 0.1070 0.1050 4 SUR MEYER 0.1350 0.1250 0.1315 5 SUR TRAUTH 0.1224 0.1124 0.1110 6 SUR TRAUTH . 0.1110 0.1110 7 SUR TRAUTH . 0.1180 0.1200 8 SUR TRAUTH 0.1250 0.1125 0.1115 9 TRI TRAUTH 0.1475 0.1475 . 10 TRI TRAUTH 0.1469 0.1369 . 11 TRI MEYER 0.1440 0.1340 0.1395 12 TRI TRAUTH 0.1420 0.1420 . 13 TRI MEYER 0.1390 0.1390 0.1390 14 TRI MEYER 0.1470 0.1370 0.1420 15 TRI MEYER . 0.1380 . 16 TRI TRAUTH 0.1474 0.1374 0.1424 17 TRI TRAUTH . 0.1349 0.1349 18 TRI MEYER 0.1505 0.1505 0.1505 19 TRI TRAUTH 0.1360 0.1320 . 20 TRI MEYER . 0.1430 . 21 TRI TRAUTH 0.1460 0.1360 . Box Plots for Whole White Milk—1986

MA RKET

WW

BID

TRI-C O UNTYSURRO UND

0.15

0.14

0.13

0.12

0.11

Boxplots for Whole White Milk - 1986

50 The Kentucky Milk Case

Page 47: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

Box Plots for Lowfat White Milk—1986

MA RKET

LF

WB

ID

TRI-C O UNTYSURRO UND

0.15

0.14

0.13

0.12

0.11

Boxplots for Lowfat White Milk - 1986

Box Plots for Lowfat Chocolate Milk—1986

MA RKET

LF

CB

ID

TRI-C O UNTYSURRO UND

0.15

0.14

0.13

0.12

0.11

0.10

Boxplots for Lowfat Chocolate Milk - 1986

The Kentucky Milk Case 51

Page 48: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

The same pattern that existed for 1985 and 1986 also existed in 1984, 1987, and 1988. From 1989 on, the pattern no longer existed. Thus, from the plots, it appears that the two dairies were working together from 1984 through 1988 in the Tri-county market. I also plotted the mean winning bids for the two dairies in each of the two markets from 1984 through 1991 for each type of milk. In all three plots, the mean winning bid in 1983 was almost the same in the two markets. Then, in 1984, the mean winning bid in the Tri-county market was higher than in the Surrounding market for all three types of milk. This trend holds basically through 1988 (the lowfat white milk mean winning bid for the Surrounding market was greater than the mean winning bid in the Tri-county market in 1988). After 1988, the mean winning bids in the two markets are almost the same. This points to collusion in the Tri-county market from 1984 through 1988.

52 The Kentucky Milk Case

Page 49: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

The dispersion, measured using the standard deviation, of the winning bids for each of the three types of milk was basically smaller in the Tri-county market than in the Surrounding market for the years 1985 through 1988. Again, after 1988 this pattern no longer existed. Again, this points to collusion between the two dairies in the Tri-county market during the years 1984 through 1988.

The Kentucky Milk Case 53

Page 50: Methods for Describing Sets of Data Chapter 2...Methods for Describing Sets of Data Chapter 2 2.2 a. To find the frequency for each class, count the number of times each letter occurs.

54 The Kentucky Milk Case