Post on 26-Dec-2015
1
Chapter 2: Methods for Describing Sets of Data
(Page 19-98)
Homework:14ab, 36, 43, 45, 51, 56, 64abc, 71, 79, 85, 89, 96
2
Section 2.1: Numerical Measures of Central Tendency (center):
• Why we are interested in the central tendency of a set of measurements?
The central tendency of a set of measurements is the tendency of the data to cluster (or center) about certain numerical values. Since it is very important to both descriptive and inferential statistics, there are many numerical measures such as mean, median, and mode available to estimate the central tendency of a set of measurements. One can not say which one is the best measure for the central tendency of a set of data because data have very different characteristic.
3
The most popular measure for the central tendency is the mean (or the arithmetic mean). We use the Greek letter µ to stand for the population mean and use the to stand for the sample mean. The mode is a useful numerical measure of the central tendency if one wants to know the measurement that occurs most frequently in the data set. The median is a good measure for the central tendency if there are several extremely large (or extremely small) measurements in the data.
• Which one is the best numerical measure for the central tendency of a set of data?
x
4
• Example 2.1 (Basic): The following data give the weekly expenditures (in dollars) on nonalcoholic beverages for 45 households randomly selected from the 1996 Diary Survey. 6.5 9.0 9.2 7.2 4.6 9.0 10.5 2.4 10.9 10.4 5.4 12.7 5.4 0.9 7.1 1.4 12.3 8.2 4.7 1.3 2.5 13.5 10.1 15.9
5.6 15.1 0.7 10.1 10.3 2.2 7.1 4.6 8.0 0.9 3.3 3.1 2.2 10.6 1.3 2.7 16.5 9.8 4.9 1.6 12.7
Use part of the SAS output in next 3 tables to find the sample size, mean, median, and mode for weekly expenditures.
5
Results for Example 2.1Variable=EXPENSE
Moments N 45 Sum Wgts 45 Mean 6.986667 Sum 314.4 Std Dev 4.468811 Variance 19.97027 Skewness 0.31744 Kurtosis -0.88551 USS 3075.3 CSS 878.692 CV 63.96199 Std Mean 0.666171
T:Mean=0 10.4878 Pr>|T| 0.0001 Num ^= 0 45 Num > 0 45 M(Sign) 22.5 Pr>=|M| 0.0001Sign Rank 517.5 Pr>=|S| 0.0001
6
Quantiles(Def=5)100% Max 16.5 99% 16.575% Q3 10.3 95% 15.150% Med 7.1 90% 12.725% Q1 2.7 10% 1.3Range 15.8Q3-Q1 7.6Mode 0.9
7
Extremes
Lowest Obs Highest Obs 0.7( 27) 12.7( 45) 0.9( 34) 13.5( 22) 0.9( 14) 15.1( 26) 1.3( 39) 15.9( 24) 1.3( 20) 16.5( 41)
8
Example 2.2 (Intermediate): Michelson conducted an experiment to determine the velocity of the light between 1879 and 1882. Table 2.1 presents Michelson's determinations minus 299000 in Km/sec.
Table 2.1 Velocity of the Light
870 890 850 1000 960 830 880 880 890 910 870 840 740 980 940 790 880 910 810 920 810 780 900 930 960 810 880 850 810 890 740 810 1070 650 940 880 860 870 820 860 810 760 930 760 880 880 720 840 800 880 940 810 850 810 800 830 720 840 770 720 950 790 950 1000 850 800 620 850 760 840 800 810 980 1000 860 790 860 840 740 850 810 820 980 960 900 760 970 840 750 850 870 850 880 960 840 800 950 840 760 780
9
Result From Example 2.2 Variable=SPEED
N 100
Mean 852.2 Sum 85220 Std Dev 78.96528 Variance 6235.515 Skewness -0.01125 Kurtosis 0.347244 USS 73241800 CSS 617316 CV 9.26605 Std Mean 7.896528
T:Mean=0 107.9209 Pr>|T| 0.0001 Num ^= 0 100 Num > 0 100 M(Sign) 50 Pr>=|M| 0.0001 Sgn Rank 2525 Pr>=|S| 0.0001
10
Quantiles(Def=5)100% Max 1070 99% 103575% Q3 895 95% 98050% Med 850 90% 96025% Q1 805 10% 760 0% Min 620 5% 730 1% 635
Range 450Q3-Q1 90Mode 810
11
Extremes
Lowest Obs Highest Obs 620( 67) 980( 83) 650( 34) 1000( 4) 720( 60) 1000( 64) 720( 57) 1000( 74) 720( 47) 1070( 33)
12
• The data set is skew to the right if there are several extremely large measurements (see Figure 2.2). In this case the mean is greater than the median and the extremely large values have a stronger impact on the mean.
• The data set is skew to the left if there are several extremely small measurements (see Figure 2.3). In this case the mean is small than the median and the extremely small values pose stronger impact on the mean as well.
• The data sets are well behaved if they are symmetric (see Figure 2.1). Symmetrical data sets pose several good properties that will be discussed in later chapters.
13
Figure 2.1 Symmetric Distribution
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Mean, Median, and Mode Overlap
14
Figure 2.2 SKEW TO THE RIGHT
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
Mean > Median
15
FIGURE 2.3 SKEW TO THE LEFT
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
Mean < Median
16
Section 2.2: Numerical Measures of Variability
• Why we are interested in numerical measures for the variability of a set of measurements?
The variability of a set of measurements is the "spread" of the data. Measure of variabiltiy is as important as the measure of central tendency. There are many significant different data sets, which can have the same mean, median, and mode. We introduce three numerical measurements: range, variance, and standard
deviatiation to estimate the variability.
17
• Why sometimes the range is not a good numerical measure for the variability of a set of data?
The variability of two sets of data can be very different even if they have a similar range because the range only depends on the largest and smallest measurements and one extremely large measurement (or one extremely small measurement) can alter the range significantly.
18
We use the symbols s and s2 to stand for the samlpe standard deviation and the sample variance, respectively, and the Greek symbols and 2 to stand for the population standard deviation and the population variance, respectively. Both standard deviation and variance are good measures for the variability of a set of measurements.
19
• Is there any set of measurements that can be completely explained by the sample mean and the sample standard deviation?
Yes. A set of measurements can be explained completely by the sample mean and the sample standard deviation of the relative frequency distribution if the data is similar to Figure 2.1.
20
Example 2.3 (Basic): Find the variance, the standard deviation and the range from SAS output in Example 2.1.
21
Example 2.4 (Intermediate):
a) Find the variance, the standard deviation and the range from SAS output in Example 2.2.
b) Find the variance, the standard deviation, and the range without three extreme values.
c) Which measure is most affected by the deletion of extreme values?
d) Comparing the mean, the median, and the mode before and after the deletion of outliers.
22
Result From Example 2.4 (Without Extreme values)
Variable=SPEEDN 97 Mean 854.433 Sum 82880 Std Dev 70.31135 Variance 4943.686 Skewness 0.206141 Kurtosis -0.57312 USS 71290000 CSS 474593.8 CV 8.229007 Std Mean 7.139036 T:Mean=0 119.6847 Pr>|T| 0.0001 Num ^= 0 97 Num > 0 97 M(Sign) 48.5 Pr>=|M| 0.0001 Sgn Rank 2376.5 Pr>=|S| 0.0001
23
Quantiles(Def=5)
100% Max 1000 99% 1000 75% Q3 890 95% 980 50% Med 850 90% 960 25% Q1 810 10% 760 0% Min 720 5% 740 1% 720 Range 280 Q3-Q1 80 Mode 810
24
Section 2.3: Interpreting the Standard Deviation
Standard deviation provides a measurement of variability of a sample. The sample with larger sample standard deviation has higher variability.
The standard deviation also provides information to answer question such as "How many measurements are within 2 standard deviations of the mean?" for any specific data set. We need to understand the following two rules in order to answer the above question.
25
Chebyshev's Rule: For any set of measurements, at least
of the measurements will fall within k standard deviations of the mean for any number of k greater than 1
(a) At least 3/4 of the measurements will fall within the interval for a sample
and for a population.
(b) At least 8/9 of the measurements will fall within the interval for a sample
and for a population.
x s x s 2 2, 2 2,
x s x s 3 3, 3 3,
1 1 2 / k
26
The Empirical Rule: The empirical rule is a rule of thumb that applies only to samples or populations with frequency distributions that are mound-shaped, i.e. the frequency distributions are similar to a bell (a) Approximately 68% of the measurements will fall within the interval for a sample and for a population.(b) Approximately 95% of the measurements will fall within the interval for a sample and for a population.(c) Approximately 99.7% of the measurements will fall within the interval for a sample and for a population.
x s x s , ,
2 2,
3 3,
x s x s 2 2,
x s x s 3 3,
27
Example 2.5 (Basic):For any set of data, what can be said about the percentage of measurements contained in each of the following intervals.
(a)
(b)
(c)
2 2to .
4 4to .
3 3 to
28
Example 2.6 (Intermediate): The mean and standard deviation of a group of one hundred NBA players are 70.25 inches and 3.25 inches, respectively. (a) How many players in this group are taller than 76.75 inches based upon the Empirical Rule? (b) Can we answer part (a) based on the Chebyshev's rule? (c) What assumption is required in order to apply the Empirical Rule?
29
Section 2.4: Numerical Measures of Relative Standing
• Can you say that you did poorly in one exam if you got 70 points?
You might do poorly or you might do a fair job in this exam. You can get the top score if all other students got less than 60 points in this extremely difficult exam. Your performance should be judged by the relative standing instead of the numerical score. Descriptive measures of the relationship of a measurement to the rest of the
date are called measures of relative standing.
30
Example 2.7 (Basic): Base on the SAS output for Example 2.1 to find the following percentiles:
(a) 10th percentile
(b) 25th percentile
(c) 50th percentile
(d) 55th percentile
(e) 90th percentile
Note:1. Median is the 50th percentile of a quantitative data
set.
2.Upper quartile is the 75th percentile and lower quartile is the 25th percentile of a quantitative data set.
31
• Quantile: Let q be any number between 0 and 1, the qth quantile denoted by Q(q) is a number such that a fraction of q of the measurements fall below and a fraction of (1-q) of the measurements fall above this number.
32
• Sample Z Score:
Suppose x is a measurement from a sample with mean and standard deviation s. The sample Z score of x is
• Population Z Score:Suppose x is a measurement from a population with mean and standard deviation . The population Z score of x is
Zx
.
Z =x - x
s.
x
33
Example 2.8: The following data give the yearly contributions (in dollars) to a local church by 35 households randomly selected from the 1996 Interview Survey.
30 50 27 25 100 300 100 75 200
76 25 15 60 240 100 130 15 200
18 10 25 50 125 200 400 500 300
34 87 24 25 140 275 250 150
(a) Find the mean and median of this set of data?
(b) Find the standard deviation and range?
(c) Compute the Z score for 200.
(d) How many measurements are fall within two standard deviations of the mean?
34
Univariate ProcedureVariable=DOLLARS N 35 Sum Wgts 35 Mean 125.1714 Sum 4381 Std Dev 120.8157 Variance 14596.44 Skewness 1.374005 Kurtosis 1.620988 USS 1044655 CSS 496279 CV 96.52021 Std Mean 20.42159 T:Mean=0 6.129369 Pr>|T| 0.0001 Num ^= 0 35 Num > 0 35 M(Sign) 17.5 Pr>=|M| 0.0001 Sgn Rank 315 Pr>=|S| 0.0001
35
Quantiles(Def=5)
100% Max 500 99% 500 75% Q3 200 95% 400 50% Med 87 90% 300 25% Q1 25 10% 18 0% Min 10 5% 15 1% 10 Range 490 Q3-Q1 175 Mode 25
36
Extremes
Lowest Obs Highest Obs 10( 20) 275( 33) 15( 17) 300( 6) 15( 12) 300( 27) 18( 19) 400( 25) 24( 30) 500( 26)
37
Section 2.5: Graphic Methods for Describing Data (Bar Chart, Pie Chart, and Histogram)
• Why we need to use graphic methods to describe data.
Mean and standard deviation alone can not characterize the wide variety of distributions that data can have. We can easily find examples that several significantly different data sets have same mean and standard deviation.
• Can we find several different data sets with same mean and standard deviation?
Three data sets in Figure 2.4 all have same mean, median, standard deviation, and variance. However, they are very different.
38
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• •
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• •
• • • • • • • • • • • • • • • • •••
Figure 2.4
A
B
C
82 87 92 97 102 107 112 117 122
39
We will not cover bar-charts, pie-charts, or histograms in this semester. Firstly, bar-charts and pie-charts pose several perception problems as indicated by the famous book entitled "The Elements of Graphing Data" (William S. Cleveland, 1995). Secondly, we focus on discussing quantitative data in this semester but both pie-charts and bar-charts are graphical tools for qualitative data. Thirdly, there is more information encoded in a well designed stem-leaf display than a histogram.
• Box-plots, and stem-leaf displays are the graphical methods discussed in this course.
40
Section 2.6: Stem-and-Leaf Display
Figure 2.5 shows a stem-and-leaf display of the ozone data (Tukey 1977). It is a hybrid between a data table and a histogram since it shows numerical values as numerals but its profile is very much like a histogram (see Figure 2.6).
One can follow the following steps to construct a stem-and-leaf display by hand. 1. Define the stem and leaf to be used.
2. Write the stems in a column arranged from the smallest stem at the top(bottom) to the largest stem at the bottom (top).
41
3. If the leaves consist of more than one digit, drop the digits after the first digit.
4. Record the leaf for each measurement in the row corresponding to its stem.
5. Find the median and highlight the leaf corresponding to the median.
6. Count the number of leaves in the row with the median and put the count in the depth column.
7. Count the number of leaves for each row from the top row to the median row and put the cumulative counts in the depth column.
8. Count the number of leaves for each row from the bottom row to the median row and put the cummulative counts in the depth column.
42
Figure 2.5 Stem-and-Leaf Depth Stem Leaf
3 17 034
5 16 99
8 15 025
12 14 1236
16 13 1346
23 12 2244455
30 11 1334899
36 10 013338
43 9 1244899
59 8 0000002235667779
(11) 7 11111122355
55 6 0114444668889
42 5 1222259
35 4 023677779
26 3 11223788888888
12 2 3444467888
2 1 44
43
Dep
thS
tem
L
eaf
317
03
4
516
99
815
02
5
1
214
12
36
1
613
13
46
2
312
22
4445
5
3
011
13
3489
9
3
610
01
3338
4
3 9
12
4489
9
59 8
0000
0022
3566
7779
(
11)
7
11
1111
2235
5
55 6
01
1444
4668
889
42 5
12
2225
9
35 4
02
3677
779
26 3
11
2237
8888
8888
12 2
34
4446
7888
2 1
44
Figure 2.6 Stem-and-Leaf Display with 90 Degree Rotation
44
Univariate ProcedureVariable=OZONE N 125 Sum Wgts 125 Mean 79.288 Sum 9911 Std Dev 39.90954 Variance 1592.771 Skewness 0.510449 Kurtosis -0.49653 USS 983327 CSS 197503.6 CV 50.3349 Std Mean 3.569618 T:Mean=0 22.2119 Pr>|T| 0.0001 Num ^= 0 125 Num > 0 125 M(Sign) 62.5 Pr>=|M| 0.0001 Sgn Rank 3937.5 Pr>=|S| 0.0001
45
Quantiles(Def=5)
100% Max 174 99% 173 75% Q3 103 95% 152 50% Med 72 90% 136 25% Q1 47 10% 31 0% Min 14 5% 24 1% 14 Range 160 Q3-Q1 56 Mode 38
46
Advantages of stem-and-leaf display:• Both the numerical values and the graphical shape
can be seen on a stem-and-leaf display.• It is very easy to locate an individual measurement
on a stem-and-leaf display.• You can sort a relative small data set by hand
using stem-and-leaf display.• You can get the following information such as
median, mode, range, maximum, minimum, upper quartile, lower quartile, and inner quartile range on a stem-and-leaf display.
47
• We can determine the symmetry information of a set of measurements from the stem-and-leaf display. A set of measurements is symmetric if its relative frequency distribution looks similar to Figure 2.1. The relative frequency distribution of Ozone data can be seen from the rotated stem-and-leaf display (Figure 2.6). Ozone data is skewed to the right because there are more observations with small values than observations with large values.
48
Example 2.9: the following table contains 48 measurements of the weight of a group of male students in STA 3023 last year.
Table 2.1
123 128 130 135 140 142 145 151 155 155 155 156
156 156 160 160 163 165 165 170 170 170 170 173 174 175 175 180 182 185 185 185 185 186 190 190 191 195 195 198 200 205 206 208 215 220 220 230
a) Construct a stem-and-leaf display for data in Table 2.1.
b) Is the data symmetric?
c) Find the mean, the median, the range, the standard deviation, the lower quartile, and the upper quartile from SAS output
49
Depth Stem Leaves
2 120 3,8
4 130 0,5
7 140 0,2,5
14 150 1,5,5,5,6,6,6
19 160 0,0,3,5,9
(8) 170 0,0,0,0,3,4,5,5
21 180 0,2,5,5,5,5,6
14 190 0,0,1,5,5,8
8 200 0,5,6,8
4 210 5
3 220 0,0
1 230 0
50
Dep
thS
tem
Lea
ves
212
03,
8
413
00,
5
714
00,
2,5
1415
01,
5,5,
5,6,
6,6
1916
00,
0,3,
5,9
(8)
170
0,0,
0,0,
3,4,
5,5
2118
00,
2,5,
5,5,
5,6
1419
00,
0,1,
5,5,
8
820
00,
5,6,
8
421
05
322
00,
0
123
00
Figure 2.7 Stem-and-Leaf Display with 90 Degree Rotation
51
SAS Output for Example 2.9Variable=WEIGHT N 48 Sum Wgts 48 Mean 174.3333 Sum 8368 Std Dev 25.41932 Variance 646.1418 Skewness 0.070001 Kurtosis -0.43366 USS 1489190 CSS 30368.67 CV 14.58087 Std Mean 3.668963 T:Mean=0 47.5157 Pr>|T| 0.0001 Num ^= 0 48 Num > 0 48 M(Sign) 24 Pr>=|M| 0.0001 Sgn Rank 588 Pr>=|S| 0.0001
52
Quantiles(Def=5)
100% Max 230 99% 230 75% Q3 190.5 95% 220 50% Med 173.5 90% 208 25% Q1 156 10% 140 0% Min 123 5% 130 1% 123 Range 107 Q3-Q1 34.5 Mode 170
53
Section 2.7: Box Plots
• Inner Quartile Range (IQR): The upper quartile
minus the lower quartile. • Step: 1.5*IQR • Upper Inner Fence: Upper quartile plus one step.
• Lower Inner Fence: Lower quartile minus one step.
• Upper Outer Fence: Upper quartile plus two steps.
• Lower Outer Fence: Lower quartile minus two steps.
• Outside Value: Any measurements that are greater than the upper inner fence or less than the lower inner fence.
54
Elements of a Box Plot: • A rectangle is drawn with the ends drawn at the lower
and upper quartiles. The median of the data is shown in the box, usually by a line through the box.
• The points at distances 1.5*IQR from each hinge mark the inner fences of the data set. Horizontal lines are drawn from each hinge to the most extreme measurement inside the inner fence.
• A second pair of fences, the outer fences, exist at a distance of 3 *IQR from the hinges. One symbol (usually "*" in SAS) is use to represent measurements falling between the inner and outer fences. Another symbol (usually "0" in SAS) is use to represent
measurements beyond the outer fence.
55
Interpretation of Box Plots• The median shows the central tendency of the data.
• The length of the box (IQR) provides a measure of the variability of the middle 50% of the data.
• The individual outside values give the viewer an opportunity to the presence of outliers, that is, observations that seem unsually, or even implausibly, large or small. Outside values are not necessarily outliers, but any outliers will almost certain appear as an outlier.
• The box plot allows a partial assessment of symmetry. The box plot is symmetric about it median if the data is symmetric. If one whisker is clearly longer, the data is
probably skewed to the direction of the longer whisker.
56
Example 2.10:
Base on the box plot for data in Example 2.1 to answer the following:
(a) Is the data symmetric?
(b) Is there any outside value?
(c) Find the upper quartile, the median, the lower quartile, minimum value, and the maximum value.
57
510
15
Figure 2.8 Box Plot for Data in Example 2.1
Weekly Expenditure (in Dollar)
58
Example 2.11:Base on the box plot for data in Example 2.2 to answer the following:
a. Is the data symmetric?
b. Is there any outside value?
c. Find the upper quartile, the median, the lower quartile, minimum value, and the maximum value.
d. Compute the inner quartile range and step.
59
700
800
900
1000
Figure 2.9 Velocity of the Light
Speed of the Light
60
Example 2.12:
Base on the box plot for data in Example 2.8 to answer the following: (a) Is the data symmetric? (b) Is there any outside value? (c) Find the upper quartile, the median, the lower quartile, minimum value, and the maximum value. (d) Compute the inner quartile range and step.
61
010 0
20 030 0
40 050
0
Figure 2.10 Box Plot for Data in Example 2.8
Yearly Contributions
62
Quick Review:
• Mean, Median, and Mode
• Range, Standard Deviation, and Variance
• Upper Quartile, Lower Quartile, and IQR
• Chebyshev's Rule and Empirical Rule
• Z-Score
• Symmetry and Skewness
• Mound-Shaped distribution
• Box-Plot and Stem-and-Leaf Display