1 Chapter 2: Methods for Describing Sets of Data (Page 19-98) Homework:14ab, 36, 43, 45, 51, 56,...

1

Chapter 2: Methods for Describing Sets of Data

(9-98)

Homework:14ab, 36, 43, 45, 51, 56, 64abc, 71, 79, 85, 89, 96

2

Section 2.1: Numerical Measures of Central Tendency (center):

• Why we are interested in the central tendency of a set of measurements?

The central tendency of a set of measurements is the tendency of the data to cluster (or center) about certain numerical values. Since it is very important to both descriptive and inferential statistics, there are many numerical measures such as mean, median, and mode available to estimate the central tendency of a set of measurements. One can not say which one is the best measure for the central tendency of a set of data because data have very different characteristic.

3

The most popular measure for the central tendency is the mean (or the arithmetic mean). We use the Greek letter µ to stand for the population mean and use the to stand for the sample mean. The mode is a useful numerical measure of the central tendency if one wants to know the measurement that occurs most frequently in the data set. The median is a good measure for the central tendency if there are several extremely large (or extremely small) measurements in the data.

• Which one is the best numerical measure for the central tendency of a set of data?

x

4

• Example 2.1 (Basic): The following data give the weekly expenditures (in dollars) on nonalcoholic beverages for 45 households randomly selected from the 1996 Diary Survey. 6.5 9.0 9.2 7.2 4.6 9.0 10.5 2.4 10.9 10.4 5.4 12.7 5.4 0.9 7.1 1.4 12.3 8.2 4.7 1.3 2.5 13.5 10.1 15.9

5.6 15.1 0.7 10.1 10.3 2.2 7.1 4.6 8.0 0.9 3.3 3.1 2.2 10.6 1.3 2.7 16.5 9.8 4.9 1.6 12.7

Use part of the SAS output in next 3 tables to find the sample size, mean, median, and mode for weekly expenditures.

5

Results for Example 2.1Variable=EXPENSE

Moments N 45 Sum Wgts 45 Mean 6.986667 Sum 314.4 Std Dev 4.468811 Variance 19.97027 Skewness 0.31744 Kurtosis -0.88551 USS 3075.3 CSS 878.692 CV 63.96199 Std Mean 0.666171

T:Mean=0 10.4878 Pr>|T| 0.0001 Num ^= 0 45 Num > 0 45 M(Sign) 22.5 Pr>=|M| 0.0001Sign Rank 517.5 Pr>=|S| 0.0001

6

Quantiles(Def=5)100% Max 16.5 99% 16.575% Q3 10.3 95% 15.150% Med 7.1 90% 12.725% Q1 2.7 10% 1.3Range 15.8Q3-Q1 7.6Mode 0.9

7

Extremes

Lowest Obs Highest Obs 0.7( 27) 12.7( 45) 0.9( 34) 13.5( 22) 0.9( 14) 15.1( 26) 1.3( 39) 15.9( 24) 1.3( 20) 16.5( 41)

8

Example 2.2 (Intermediate): Michelson conducted an experiment to determine the velocity of the light between 1879 and 1882. Table 2.1 presents Michelson's determinations minus 299000 in Km/sec.

Table 2.1 Velocity of the Light

870 890 850 1000 960 830 880 880 890 910 870 840 740 980 940 790 880 910 810 920 810 780 900 930 960 810 880 850 810 890 740 810 1070 650 940 880 860 870 820 860 810 760 930 760 880 880 720 840 800 880 940 810 850 810 800 830 720 840 770 720 950 790 950 1000 850 800 620 850 760 840 800 810 980 1000 860 790 860 840 740 850 810 820 980 960 900 760 970 840 750 850 870 850 880 960 840 800 950 840 760 780

9

Result From Example 2.2 Variable=SPEED

N 100

Mean 852.2 Sum 85220 Std Dev 78.96528 Variance 6235.515 Skewness -0.01125 Kurtosis 0.347244 USS 73241800 CSS 617316 CV 9.26605 Std Mean 7.896528

T:Mean=0 107.9209 Pr>|T| 0.0001 Num ^= 0 100 Num > 0 100 M(Sign) 50 Pr>=|M| 0.0001 Sgn Rank 2525 Pr>=|S| 0.0001

10

Quantiles(Def=5)100% Max 1070 99% 103575% Q3 895 95% 98050% Med 850 90% 96025% Q1 805 10% 760 0% Min 620 5% 730 1% 635

Range 450Q3-Q1 90Mode 810

11

Extremes

Lowest Obs Highest Obs 620( 67) 980( 83) 650( 34) 1000( 4) 720( 60) 1000( 64) 720( 57) 1000( 74) 720( 47) 1070( 33)

12

• The data set is skew to the right if there are several extremely large measurements (see Figure 2.2). In this case the mean is greater than the median and the extremely large values have a stronger impact on the mean.

• The data set is skew to the left if there are several extremely small measurements (see Figure 2.3). In this case the mean is small than the median and the extremely small values pose stronger impact on the mean as well.

• The data sets are well behaved if they are symmetric (see Figure 2.1). Symmetrical data sets pose several good properties that will be discussed in later chapters.

13

Figure 2.1 Symmetric Distribution

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Mean, Median, and Mode Overlap

14

Figure 2.2 SKEW TO THE RIGHT

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

Mean > Median

15

FIGURE 2.3 SKEW TO THE LEFT

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

Mean < Median

16

Section 2.2: Numerical Measures of Variability

• Why we are interested in numerical measures for the variability of a set of measurements?

The variability of a set of measurements is the "spread" of the data. Measure of variabiltiy is as important as the measure of central tendency. There are many significant different data sets, which can have the same mean, median, and mode. We introduce three numerical measurements: range, variance, and standard

deviatiation to estimate the variability.

17

• Why sometimes the range is not a good numerical measure for the variability of a set of data?

The variability of two sets of data can be very different even if they have a similar range because the range only depends on the largest and smallest measurements and one extremely large measurement (or one extremely small measurement) can alter the range significantly.

18

We use the symbols s and s2 to stand for the samlpe standard deviation and the sample variance, respectively, and the Greek symbols and 2 to stand for the population standard deviation and the population variance, respectively. Both standard deviation and variance are good measures for the variability of a set of measurements.

19

• Is there any set of measurements that can be completely explained by the sample mean and the sample standard deviation?

Yes. A set of measurements can be explained completely by the sample mean and the sample standard deviation of the relative frequency distribution if the data is similar to Figure 2.1.

20

Example 2.3 (Basic): Find the variance, the standard deviation and the range from SAS output in Example 2.1.

21

Example 2.4 (Intermediate):

a) Find the variance, the standard deviation and the range from SAS output in Example 2.2.

b) Find the variance, the standard deviation, and the range without three extreme values.

c) Which measure is most affected by the deletion of extreme values?

d) Comparing the mean, the median, and the mode before and after the deletion of outliers.

22

Result From Example 2.4 (Without Extreme values)

Variable=SPEEDN 97 Mean 854.433 Sum 82880 Std Dev 70.31135 Variance 4943.686 Skewness 0.206141 Kurtosis -0.57312 USS 71290000 CSS 474593.8 CV 8.229007 Std Mean 7.139036 T:Mean=0 119.6847 Pr>|T| 0.0001 Num ^= 0 97 Num > 0 97 M(Sign) 48.5 Pr>=|M| 0.0001 Sgn Rank 2376.5 Pr>=|S| 0.0001

23

Quantiles(Def=5)

100% Max 1000 99% 1000 75% Q3 890 95% 980 50% Med 850 90% 960 25% Q1 810 10% 760 0% Min 720 5% 740 1% 720 Range 280 Q3-Q1 80 Mode 810

24

Section 2.3: Interpreting the Standard Deviation

Standard deviation provides a measurement of variability of a sample. The sample with larger sample standard deviation has higher variability.

The standard deviation also provides information to answer question such as "How many measurements are within 2 standard deviations of the mean?" for any specific data set. We need to understand the following two rules in order to answer the above question.

25

Chebyshev's Rule: For any set of measurements, at least

of the measurements will fall within k standard deviations of the mean for any number of k greater than 1

(a) At least 3/4 of the measurements will fall within the interval for a sample

and for a population.

(b) At least 8/9 of the measurements will fall within the interval for a sample

and for a population.

x s x s 2 2, 2 2,

x s x s 3 3, 3 3,

1 1 2 / k

26

The Empirical Rule: The empirical rule is a rule of thumb that applies only to samples or populations with frequency distributions that are mound-shaped, i.e. the frequency distributions are similar to a bell (a) Approximately 68% of the measurements will fall within the interval for a sample and for a population.(b) Approximately 95% of the measurements will fall within the interval for a sample and for a population.(c) Approximately 99.7% of the measurements will fall within the interval for a sample and for a population.

x s x s , ,

2 2,

3 3,

x s x s 2 2,

x s x s 3 3,

27

Example 2.5 (Basic):For any set of data, what can be said about the percentage of measurements contained in each of the following intervals.

(a)

(b)

(c)

2 2to .

4 4to .

3 3 to

28

Example 2.6 (Intermediate): The mean and standard deviation of a group of one hundred NBA players are 70.25 inches and 3.25 inches, respectively. (a) How many players in this group are taller than 76.75 inches based upon the Empirical Rule? (b) Can we answer part (a) based on the Chebyshev's rule? (c) What assumption is required in order to apply the Empirical Rule?

29

Section 2.4: Numerical Measures of Relative Standing

• Can you say that you did poorly in one exam if you got 70 points?

You might do poorly or you might do a fair job in this exam. You can get the top score if all other students got less than 60 points in this extremely difficult exam. Your performance should be judged by the relative standing instead of the numerical score. Descriptive measures of the relationship of a measurement to the rest of the

date are called measures of relative standing.

30

Example 2.7 (Basic): Base on the SAS output for Example 2.1 to find the following percentiles:

(a) 10th percentile

(b) 25th percentile

(c) 50th percentile

(d) 55th percentile

(e) 90th percentile

Note:1. Median is the 50th percentile of a quantitative data

set.

2.Upper quartile is the 75th percentile and lower quartile is the 25th percentile of a quantitative data set.

31

• Quantile: Let q be any number between 0 and 1, the qth quantile denoted by Q(q) is a number such that a fraction of q of the measurements fall below and a fraction of (1-q) of the measurements fall above this number.

32

• Sample Z Score:

Suppose x is a measurement from a sample with mean and standard deviation s. The sample Z score of x is

• Population Z Score:Suppose x is a measurement from a population with mean and standard deviation . The population Z score of x is

Zx

.

Z =x - x

s.

x

33

Example 2.8: The following data give the yearly contributions (in dollars) to a local church by 35 households randomly selected from the 1996 Interview Survey.

30 50 27 25 100 300 100 75 200

76 25 15 60 240 100 130 15 200

18 10 25 50 125 200 400 500 300

34 87 24 25 140 275 250 150

(a) Find the mean and median of this set of data?

(b) Find the standard deviation and range?

(c) Compute the Z score for 200.

(d) How many measurements are fall within two standard deviations of the mean?

34

Univariate ProcedureVariable=DOLLARS N 35 Sum Wgts 35 Mean 125.1714 Sum 4381 Std Dev 120.8157 Variance 14596.44 Skewness 1.374005 Kurtosis 1.620988 USS 1044655 CSS 496279 CV 96.52021 Std Mean 20.42159 T:Mean=0 6.129369 Pr>|T| 0.0001 Num ^= 0 35 Num > 0 35 M(Sign) 17.5 Pr>=|M| 0.0001 Sgn Rank 315 Pr>=|S| 0.0001

35

Quantiles(Def=5)


36

Extremes

Lowest Obs Highest Obs 10( 20) 275( 33) 15( 17) 300( 6) 15( 12) 300( 27) 18( 19) 400( 25) 24( 30) 500( 26)

37

Section 2.5: Graphic Methods for Describing Data (Bar Chart, Pie Chart, and Histogram)

• Why we need to use graphic methods to describe data.

Mean and standard deviation alone can not characterize the wide variety of distributions that data can have. We can easily find examples that several significantly different data sets have same mean and standard deviation.

• Can we find several different data sets with same mean and standard deviation?

Three data sets in Figure 2.4 all have same mean, median, standard deviation, and variance. However, they are very different.

38

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• •

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• •

• • • • • • • • • • • • • • • • •••

Figure 2.4

A

B

C

82 87 92 97 102 107 112 117 122

39

We will not cover bar-charts, pie-charts, or histograms in this semester. Firstly, bar-charts and pie-charts pose several perception problems as indicated by the famous book entitled "The Elements of Graphing Data" (William S. Cleveland, 1995). Secondly, we focus on discussing quantitative data in this semester but both pie-charts and bar-charts are graphical tools for qualitative data. Thirdly, there is more information encoded in a well designed stem-leaf display than a histogram.

• Box-plots, and stem-leaf displays are the graphical methods discussed in this course.

40

Section 2.6: Stem-and-Leaf Display

Figure 2.5 shows a stem-and-leaf display of the ozone data (Tukey 1977). It is a hybrid between a data table and a histogram since it shows numerical values as numerals but its profile is very much like a histogram (see Figure 2.6).

One can follow the following steps to construct a stem-and-leaf display by hand. 1. Define the stem and leaf to be used.

2. Write the stems in a column arranged from the smallest stem at the top(bottom) to the largest stem at the bottom (top).

41

3. If the leaves consist of more than one digit, drop the digits after the first digit.

4. Record the leaf for each measurement in the row corresponding to its stem.

5. Find the median and highlight the leaf corresponding to the median.

6. Count the number of leaves in the row with the median and put the count in the depth column.

7. Count the number of leaves for each row from the top row to the median row and put the cumulative counts in the depth column.

8. Count the number of leaves for each row from the bottom row to the median row and put the cummulative counts in the depth column.

42

Figure 2.5 Stem-and-Leaf Depth Stem Leaf

3 17 034

5 16 99

8 15 025

12 14 1236

16 13 1346

23 12 2244455

30 11 1334899

36 10 013338

43 9 1244899

59 8 0000002235667779

(11) 7 11111122355

55 6 0114444668889

42 5 1222259

35 4 023677779

26 3 11223788888888

12 2 3444467888

2 1 44

43

Dep

thS

tem

L

eaf

317

03

4

516

99

815

02

5

1

214

12

36

1

613

13

46

2

312

22

4445

5

3

011

13

3489

9

3

610

01

3338

4

3 9

12

4489

9

59 8

0000

0022

3566

7779

(

11)

7

11

1111

2235

5

55 6

01

1444

4668

889

42 5

12

2225

9

35 4

02

3677

779

26 3

11

2237

8888

8888

12 2

34

4446

7888

2 1

44

Figure 2.6 Stem-and-Leaf Display with 90 Degree Rotation

44

Univariate ProcedureVariable=OZONE N 125 Sum Wgts 125 Mean 79.288 Sum 9911 Std Dev 39.90954 Variance 1592.771 Skewness 0.510449 Kurtosis -0.49653 USS 983327 CSS 197503.6 CV 50.3349 Std Mean 3.569618 T:Mean=0 22.2119 Pr>|T| 0.0001 Num ^= 0 125 Num > 0 125 M(Sign) 62.5 Pr>=|M| 0.0001 Sgn Rank 3937.5 Pr>=|S| 0.0001

45

Quantiles(Def=5)


46

Advantages of stem-and-leaf display:• Both the numerical values and the graphical shape

can be seen on a stem-and-leaf display.• It is very easy to locate an individual measurement

on a stem-and-leaf display.• You can sort a relative small data set by hand

using stem-and-leaf display.• You can get the following information such as

median, mode, range, maximum, minimum, upper quartile, lower quartile, and inner quartile range on a stem-and-leaf display.

47

• We can determine the symmetry information of a set of measurements from the stem-and-leaf display. A set of measurements is symmetric if its relative frequency distribution looks similar to Figure 2.1. The relative frequency distribution of Ozone data can be seen from the rotated stem-and-leaf display (Figure 2.6). Ozone data is skewed to the right because there are more observations with small values than observations with large values.

48

Example 2.9: the following table contains 48 measurements of the weight of a group of male students in STA 3023 last year.

Table 2.1

123 128 130 135 140 142 145 151 155 155 155 156

156 156 160 160 163 165 165 170 170 170 170 173 174 175 175 180 182 185 185 185 185 186 190 190 191 195 195 198 200 205 206 208 215 220 220 230

a) Construct a stem-and-leaf display for data in Table 2.1.

b) Is the data symmetric?

c) Find the mean, the median, the range, the standard deviation, the lower quartile, and the upper quartile from SAS output

49

Depth Stem Leaves

2 120 3,8

4 130 0,5

7 140 0,2,5

14 150 1,5,5,5,6,6,6

19 160 0,0,3,5,9

(8) 170 0,0,0,0,3,4,5,5

21 180 0,2,5,5,5,5,6

14 190 0,0,1,5,5,8

8 200 0,5,6,8

4 210 5

3 220 0,0

1 230 0

50

Dep

thS

tem

Lea

ves

212

03,

8

413

00,

5

714

00,

2,5

1415

01,

5,5,

5,6,

6,6

1916

00,

0,3,

5,9

(8)

170

0,0,

0,0,

3,4,

5,5

2118

00,

2,5,

5,5,

5,6

1419

00,

0,1,

5,5,

8

820

00,

5,6,

8

421

05

322

00,

0

123

00

Figure 2.7 Stem-and-Leaf Display with 90 Degree Rotation

51

SAS Output for Example 2.9Variable=WEIGHT N 48 Sum Wgts 48 Mean 174.3333 Sum 8368 Std Dev 25.41932 Variance 646.1418 Skewness 0.070001 Kurtosis -0.43366 USS 1489190 CSS 30368.67 CV 14.58087 Std Mean 3.668963 T:Mean=0 47.5157 Pr>|T| 0.0001 Num ^= 0 48 Num > 0 48 M(Sign) 24 Pr>=|M| 0.0001 Sgn Rank 588 Pr>=|S| 0.0001

52

Quantiles(Def=5)

100% Max 230 99% 230 75% Q3 190.5 95% 220 50% Med 173.5 90% 208 25% Q1 156 10% 140 0% Min 123 5% 130 1% 123 Range 107 Q3-Q1 34.5 Mode 170

53

Section 2.7: Box Plots

• Inner Quartile Range (IQR): The upper quartile

minus the lower quartile. • Step: 1.5*IQR • Upper Inner Fence: Upper quartile plus one step.

• Lower Inner Fence: Lower quartile minus one step.

• Upper Outer Fence: Upper quartile plus two steps.

• Lower Outer Fence: Lower quartile minus two steps.

• Outside Value: Any measurements that are greater than the upper inner fence or less than the lower inner fence.

54

Elements of a Box Plot: • A rectangle is drawn with the ends drawn at the lower

and upper quartiles. The median of the data is shown in the box, usually by a line through the box.

• The points at distances 1.5*IQR from each hinge mark the inner fences of the data set. Horizontal lines are drawn from each hinge to the most extreme measurement inside the inner fence.

• A second pair of fences, the outer fences, exist at a distance of 3 *IQR from the hinges. One symbol (usually "*" in SAS) is use to represent measurements falling between the inner and outer fences. Another symbol (usually "0" in SAS) is use to represent

measurements beyond the outer fence.

55

Interpretation of Box Plots• The median shows the central tendency of the data.

• The length of the box (IQR) provides a measure of the variability of the middle 50% of the data.

• The individual outside values give the viewer an opportunity to the presence of outliers, that is, observations that seem unsually, or even implausibly, large or small. Outside values are not necessarily outliers, but any outliers will almost certain appear as an outlier.

• The box plot allows a partial assessment of symmetry. The box plot is symmetric about it median if the data is symmetric. If one whisker is clearly longer, the data is

probably skewed to the direction of the longer whisker.

56

Example 2.10:

Base on the box plot for data in Example 2.1 to answer the following:

(a) Is the data symmetric?

(b) Is there any outside value?

(c) Find the upper quartile, the median, the lower quartile, minimum value, and the maximum value.

57

510

15

Figure 2.8 Box Plot for Data in Example 2.1

Weekly Expenditure (in Dollar)

58

Example 2.11:Base on the box plot for data in Example 2.2 to answer the following:

a. Is the data symmetric?

b. Is there any outside value?

c. Find the upper quartile, the median, the lower quartile, minimum value, and the maximum value.

d. Compute the inner quartile range and step.

59

700

800

900

1000

Figure 2.9 Velocity of the Light

Speed of the Light

60

Example 2.12:

Base on the box plot for data in Example 2.8 to answer the following: (a) Is the data symmetric? (b) Is there any outside value? (c) Find the upper quartile, the median, the lower quartile, minimum value, and the maximum value. (d) Compute the inner quartile range and step.

61

010 0

20 030 0

40 050

0

Figure 2.10 Box Plot for Data in Example 2.8

Yearly Contributions

62

Quick Review:

• Mean, Median, and Mode

• Range, Standard Deviation, and Variance

• Upper Quartile, Lower Quartile, and IQR

• Chebyshev's Rule and Empirical Rule

• Z-Score

• Symmetry and Skewness

• Mound-Shaped distribution

• Box-Plot and Stem-and-Leaf Display

1 Chapter 2: Methods for Describing Sets of Data (Page 19-98) Homework:14ab, 36, 43, 45, 51, 56,...

Documents

Transcript of 1 Chapter 2: Methods for Describing Sets of Data (Page 19-98) Homework:14ab, 36, 43, 45, 51, 56,...