Chapter 2 Descriptive Statistics - Queen's...
Transcript of Chapter 2 Descriptive Statistics - Queen's...
Chapter 2
Descriptive Statistics
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-2
After completing this chapter, you should be able to:
Compute and interpret the mean, median, andmode for a set of data
Find the range, variance, standard deviation, andcoefficient of variation and know what these values mean
Apply the empirical rule to describe the variation of population values around the mean
Explain the weighted mean and when to use it
Chapter Goals
Figure 2.1:
1
2 CHAPTER 2. DESCRIPTIVE STATISTICS
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-3
Chapter Topics
Measures of central tendency, variation, and shape Mean, median, mode, geometric mean Quartiles
Range, interquartile range, variance and standard deviation, coefficient of variation
Symmetric and skewed distributions
Population summary measures Mean, variance, and standard deviation
Figure 2.2:
2.1. MEASUREOF CENTRAL TENDENCYORMEASURES OF LOCATION 3
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-5
Describing Data Numerically
Arithmetic Mean
Median
Mode
Describing Data Numerically
Variance
Standard Deviation
Coefficient of Variation
Range
Interquartile Range
Central Tendency Variation
Figure 2.3:
• We will consider the description of a data set, with and observations
for the sample and population respectively
• Often referred to as data reduction
2.1 Measure of Central Tendency or Measures of
Location
4 CHAPTER 2. DESCRIPTIVE STATISTICS
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-6
Measures of Central Tendency
Central Tendency
Mean Median Mode
n
xx
n
1ii
Overview
Midpoint of ranked values
Most frequently observed value
Arithmetic average
Figure 2.4:
2.1.1 Population Mean
=1
X=1
2.1. MEASUREOF CENTRAL TENDENCYORMEASURES OF LOCATION 5
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-7
Arithmetic Mean
The arithmetic mean (mean) is the most common measure of central tendency
For a population of N values:
For a sample of size n:
Sample sizen
xxx
n
xx n21
n
1ii
Observed
values
N
xxx
N
xμ N21
N
1ii
Population size
Population values
Figure 2.5:
6 CHAPTER 2. DESCRIPTIVE STATISTICSParameter: A characteristic of a population
• Usually we do not have the population available
• The Kiers family owns four vehicles. Following is the current mileage for each:
56,000; 23,000; 42,000; and 73,000.
= (56 000 + 23 000 + 42 000 + 73 000)4 = 48 500
2.1.2 Sample Mean
=1
X=1
Statistic: A characteristic of a sample
• Sample mean is an estimator of the population mean .
2.1. MEASUREOF CENTRAL TENDENCYORMEASURES OF LOCATION 7
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-8
Arithmetic Mean
The most common measure of central tendency
Mean = sum of values divided by the number of values Affected by extreme values (outliers)
(continued)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
3515
554321
4520
5104321
Figure 2.6:
8 CHAPTER 2. DESCRIPTIVE STATISTICS• Notice that the unit for the sample mean is the same as the observations
themselves.
• Mean is sensitive to Outliers!
2.1.3 Combining Sample Means with different Sample Sizes:
Weighted Means
Suppose the sample mean in Economics 01 is 65 and that the mean male score is
62 and the female was 82. What is the proportion of female students in the class?
=1
X=1
=1
X=1
and
=1
( + )
+ X=1
.
We know that:
=
( + ) +
( + )
If we let be the proportion of females then:
= (1− ) +
65 = (1− ) × 62 + × 82
Solving for gives .15.
• Notice carefully to get the overall sum for the sample we must weight each of
the respective sample means (male and female) by their relative proportion
( ) of the sample.
• Weighted mean::
=X=1
whereX
= 1
• Question: If given unemployment rates for each of the 10 provinces how would
you calculate the unemployment rate for Canada?
2.1. MEASUREOF CENTRAL TENDENCYORMEASURES OF LOCATION 9
2.1.4 Grouped Data
For grouped data we can calculate the sample mean as
=1
X=1
where we have classes and is themidpoint (class mark or midpoint) for class
i with frequency .
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-43
Approximations for Grouped Data
Suppose a data set contains values m1, m2, . . ., mk, occurring with frequencies f1, f2, . . . fK
For a population of N observations the mean is
For a sample of n observations, the mean is
N
mfμ
K
1iii
n
mf
x
K
1iii
K
1iifNwhere
K
1iifnwhere
Figure 2.7:
10 CHAPTER 2. DESCRIPTIVE STATISTICS2.1.5 Median-Md (2)
The median Md of a set of data is a number such that half of the observations
are below the number and half of the observations are above that number when the
data are ordered in an array from lowest to highest.
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-9
Median
In an ordered list, the median is the “middle”number (50% above, 50% below)
Not affected by extreme values
0 1 2 3 4 5 6 7 8 9 10
Median = 3
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Figure 2.8:
2.1. MEASUREOF CENTRAL TENDENCYORMEASURES OF LOCATION 11
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-10
Finding the Median
The location of the median:
If the number of values is odd, the median is the middle number
If the number of values is even, the median is the average of the two middle numbers
Note that is not the value of the median, only the
position of the median in the ranked data
dataorderedtheinposition2
1npositionMedian
21n
Figure 2.9:
12 CHAPTER 2. DESCRIPTIVE STATISTICS
Figure 2.10: Number of Movies Showing at Theater
M o v ie ss h o w in g
F r e q u e n c y C u m u la t iv eF r e q u e n c y
1 -2 1 1
3 -4 2 3
5 -6 3 6
7 -8 1 7
9 -1 0 3 1 0
2.1.6 Median with Grouped Data
For grouped data we get the rather complex equation (recall ogive approximation
method)
= +2−
×
where
= lower boundary of class containing the median
= width of class containing median
= frequency of class containing median2− =
2minus the cumulative frequency of the class preceding the class
containing the median and being the total number of observations.
The median class is 5-6, since it contains the 5th value (n/2 =5)
2.1.7 Notes
• If n is odd then the median is the middle value of the ordered array.
• If n is even, then add the two central values and divide by two
• Suppose we have data = 10 (even) that has been ordered smallest to
largest:
2 5 6 7 10 15 21 21 23 25
Median= (10+15)/2=12.5.
• Median is not so sensitive to outliers in the data
• Unique median for each data set.
2.1. MEASUREOF CENTRAL TENDENCYORMEASURES OF LOCATION 13
2.1.8 Mode - Mo
The mode Mo of a set of numbers most frequently occurring value of the data set.
Ungrouped Data
2 4 2 8 7 mode is 2
Grouped Data: The midpoint (class mark) of the class containing the largest
number of class frequencies.
• Notice the mode need not be unique (i.e. bimodal)
• Not a good measure of central tendency
• Depends on frequency of observation rather than a balance over all observa-tions.
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-11
Mode
A measure of central tendency
Value that occurs most often
Not affected by extreme values Used for either numerical or categorical data
There may may be no mode
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
Figure 2.11:
14 CHAPTER 2. DESCRIPTIVE STATISTICS2.1.9 Relationship between Mean, Median and Mode
• Symmetric Distribution:
Median=Mean=(Mode)
• Right Skewed Distribution (Positively skewed):
Mean and Median are to the right of the Mode⇒ ModeMedianMean
• Left Skewed Distribution (Negatively Skewed):
Mean and Median are to the left of the Mode⇒ MeanMedianMode
2.2. MEASURES OF VARIABILITY OR DISPERSIONS 15
2.2 Measures of Variability or Dispersions
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-16
Same center, different variation
Measures of Variability
Variation
Variance Standard Deviation
Coefficient of Variation
Range Interquartile Range
Measures of variation give information on the spread or variability of the data values.
Figure 2.12:
16 CHAPTER 2. DESCRIPTIVE STATISTICS2.2.1 Range
• The range is the difference between the largest and the smallest values of the
data.
RANGE = −
• A sample of five accounting graduates revealed the following starting salaries:
$22,000, $28,000, $31,000, $23,000, $24,000. The range is $31,000 - $22,000 =
$9,000.
• Problem with this measure of dispersion is that it depends upon only two
numbers and is particularly sensitive to outliers in the data.
• Interquartile Range The interquartile range is 3 - 1, this is the range of
the middle 50 % of the data.
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-19
Interquartile Range
Can eliminate some outlier problems by using the interquartile range
Eliminate high- and low-valued observations and calculate the range of the middle 50% of the data
Interquartile range = 3rd quartile – 1st quartileIQR = Q3 – Q1
Figure 2.13:
2.2. MEASURES OF VARIABILITY OR DISPERSIONS 17
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-20
Interquartile Range
Median(Q2)
XmaximumX
minimum Q1 Q3
Example:
25% 25% 25% 25%
12 30 45 57 70
Interquartile range = 57 – 30 = 27
Figure 2.14:
18 CHAPTER 2. DESCRIPTIVE STATISTICS
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-21
Quartiles
Quartiles split the ranked data into 4 segments with an equal number of values per segment
25% 25% 25% 25%
The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are larger)
Only 25% of the observations are greater than the third quartile
Q1 Q2 Q3
Figure 2.15:
2.2. MEASURES OF VARIABILITY OR DISPERSIONS 19
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-22
Quartile Formulas
Find a quartile by determining the value in the appropriate position in the ranked data, where
First quartile position: Q1 = 0.25(n+1)
Second quartile position: Q2 = 0.50(n+1)(the median position)
Third quartile position: Q3 = 0.75(n+1)
where n is the number of observed values
Figure 2.16:
20 CHAPTER 2. DESCRIPTIVE STATISTICS
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-23
(n = 9)
Q1 = is in the 0.25(9+1) = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,
so Q1 = 12.5
Quartiles
Sample Ranked Data: 11 12 13 16 16 17 18 21 22
Example: Find the first quartile
Figure 2.17:
2.2.2 Deviation from the Mean
The deviation from the sample mean is
− = 1
We might consider one measure of variation as the Average Deviation:
1
X=1
[ − ]
except this is equal to 0 (recall proof in first chapter)
2.3. POPULATION VARIANCE 21
2.2.3 Mean (Absolute) Deviations (MAD)
=1
X=1
| − |
• Example: The weights of a sample of crates containing books for the book-
store are (in lbs.) 103, 97, 101, 106, 103. = 12/5 = 2.4
2.3 Population Variance
2 =1
X=1
[ − ]2
=1
X=1
2 − 2
Notes
• 2 ≥ 0
• The square root of the variance, is called the population standard deviation
or root-mean squared deviation of X.
• is measured in the same units as X.
Sample Variance
2 = 1−1
P=1[ − ]2
Alternatively and try to derive this
2 = 1−1 [
P=1
2 − 2]
2 =
P
=12−(P
)2
−1
Grouped data
2 =1
− 1X=1
[ − ]2
22 CHAPTER 2. DESCRIPTIVE STATISTICS
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-44
Approximations for Grouped Data
Suppose a data set contains values m1, m2, . . ., mk, occurring with frequencies f1, f2, . . . fK
For a population of N observations the variance is
For a sample of n observations, the variance is
N
μ)(mfσ
K
1i
2ii
2
1n
)x(mfs
K
1i
2ii
2
Figure 2.18:
• We can have two data sets with the same means but quite different sample
variances.
• 2 is measured in 2
• is measured in the same units as the observations themselves.
2.3. POPULATION VARIANCE 23
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-28
Calculation Example:Sample Standard Deviation
Sample Data (xi) : 10 12 14 15 17 18 18 24
n = 8 Mean = x = 16
4.24267
126
18
16)(2416)(1416)(1216)(10
1n
)x(24)x(14)x(12)X(10s
2222
2222
A measure of the “average”scatter around the mean
Figure 2.19:
24 CHAPTER 2. DESCRIPTIVE STATISTICS
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-29
Measuring variation
Small standard deviation
Large standard deviation
Figure 2.20:
2.3. POPULATION VARIANCE 25
Empirical Rule for Bell Shaped (Normal) Distributions
For data that give rise to histograms with the familiar (symmetric) bell-shaped
appearance then:
• 68% of data lies within the interval ±
• 95% of data lies within the interval ± 2
• 99.7% of data lies within the interval ± 3
26 CHAPTER 2. DESCRIPTIVE STATISTICS2.3.1 Empirical Rule for Standard Deviations
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-34
If the data distribution is bell-shaped, then the interval:
contains about 68% of the values in the population or the sample
The Empirical Rule
1σμ
μ
68%
1σμ
Figure 2.21:
2.3. POPULATION VARIANCE 27
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-35
contains about 95% of the values in the population or the sample
contains about 99.7% of the values in the population or the sample
The Empirical Rule Normal
2σμ
3σμ
3σμ
99.7%95%
2σμ
Figure 2.22:
28 CHAPTER 2. DESCRIPTIVE STATISTICSFor most data, 90% or more of the values lie within plus or minus three standard
deviations from the mean
2.4. COEFFICIENT OF VARIATION ( ) 29
2.4 Coefficient of Variation ( )
=
× 100% for population
=
× 100% for sample
• Adjusts the standard deviation (population or sample) by dividing by themean again either by population or sample)
• Unit free and allows comparisons of variations of similar objects that controlfor size.
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-37
Comparing Coefficient of Variation
Stock A: Average price last year = $50
Standard deviation = $5
Stock B:
Average price last year = $100 Standard deviation = $5
Both stocks have the same standard deviation, but stock B is less variable relative to its price
10%100%$50
$5100%
x
sCVA
5%100%$100
$5100%
x
sCVB
Figure 2.23:
30 CHAPTER 2. DESCRIPTIVE STATISTICS2.5 Covariances: Population and Sample
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-45
The Sample Covariance The covariance measures the strength of the linear relationship
between two variables
The population covariance:
The sample covariance:
Only concerned with the strength of the relationship
No causal effect is implied
N
))(y(xy),(xCov
N
1iyixi
xy
1n
)y)(yx(xsy),(xCov
n
1iii
xy
Figure 2.24:
2.5. COVARIANCES: POPULATION AND SAMPLE 31
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-46
Covariance between two variables:
Cov(x,y) > 0 x and y tend to move in the same direction
Cov(x,y) < 0 x and y tend to move in opposite directions
Cov(x,y) = 0 x and y are independent
Interpreting Covariance
Figure 2.25:
32 CHAPTER 2. DESCRIPTIVE STATISTICS2.6 Correlation: Population and Sample
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-47
Coefficient of Correlation
Measures the relative strength of the linear relationship between two variables
Population correlation coefficient:
Sample correlation coefficient:
YX ss
y),(xCovr
YXσσ
y),(xCovρ
Figure 2.26:
2.7. EXPLORATORY DATA ANALYSIS 33
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-48
Features of Correlation Coefficient, r
Unit free
Ranges between –1 and 1
The closer to –1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear
relationship
The closer to 0, the weaker any positive linear
relationship
Figure 2.27:
2.7 Exploratory Data Analysis
2.7.1 Some Remarks
• Often we want to display graphically a measure of central tendency and
dispersion and the simplest way to do this is by a box plot.
2.8 BOX-and-Whisker PLOTS
• Plots 5 numbers along a single axis: (Five Number Summary)
• Minimum and Maximum Values
• 1, 2 and 3
• Example:
Based on a sample of 20 deliveries, Marco’s Pizza determined the following in-
formation: minimum value = 13 minutes, Q1 = 15 minutes, median = 18 minutes,
Q3 = 22 minutes, maximum value = 30 minutes
34 CHAPTER 2. DESCRIPTIVE STATISTICS
Figure 2.28: Vertical Box Plots
Usually Box-and-Whisker Plots are horizontal, but for some reason many sta-
tistical programs (STATA and Excel) plot them vertically.
2.9 Mathematical Note: Linear Transformations
• Let and be two constants and = 1 be a variable.
• An example would be with kilometers, and miles, then = 0 and
= 06.
A linear transformation is defined as:
= +
is said to be defined by a linear combination of .
• We note that is the intercept and is the slope of the equation.
2.10. STANDARDIZED VARIABLE () 35
• Notice that changes the origin and that changes the scale.
• How is the mean and variance of the variable Y related to the mean and the
variance of X?
• First consider the population case:
Mean of a Linear Transformation
= 1
P=1
= 1
P=1[ + ]
= + 1
P=1
= +
Variance of a Linear Transformation
2 = 1
P=1[ − ]
2
= 1
P=1[ + − { + }]2
= 1
P=1[{ − }]2
= 2 1
P=1[ − ]
2
= 2 2
This implies
= ||It also follow that for the sample counterparts that
= +
2 = 2 2
= ||
2.10 Standardized Variable ()
• Finally we may define a standardized variable (in terms of the sample) as
= −
Notes on Standarizing
36 CHAPTER 2. DESCRIPTIVE STATISTICS— This variable has a mean of zero (prove this using the results above
on linear transformations) and a sample standard deviation and variance
equal to 1 ( = − , = 1
).
— Variables defined in this way are sometimes referred to as , Z-scores.—
= 0 2 = = 1
— We can do the same type of transformation in terms of the population
magnitudes and .
— Notice that if an observation is below the mean then the − is
negative and above if it is positive.
Questions
There are many problems on means, variances and so on in LMM and you
should make sure you are comfortable with the topics by doings as many as
you can.
— NCT 2.28 a-c, 2.21, 2.29a,b,d,
— Linear Transformations Show the relationship between centigrade and
Fahrenheit.
— What temperature measure shows greater variation and hence provides
more detail on temperature.
— In general comment upon the relationship between the variance of the
transformed variable and the original as it relates to the slope parameter.