Chapter 2 Descriptive Statistics - Queen's...

Chapter 2

Descriptive Statistics

Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 3-2

After completing this chapter, you should be able to:

Compute and interpret the mean, median, andmode for a set of data

Find the range, variance, standard deviation, andcoefficient of variation and know what these values mean

Apply the empirical rule to describe the variation of population values around the mean

Explain the weighted mean and when to use it

Chapter Goals

Figure 2.1:

1

2 CHAPTER 2. DESCRIPTIVE STATISTICS


Chapter Topics

Measures of central tendency, variation, and shape Mean, median, mode, geometric mean Quartiles

Range, interquartile range, variance and standard deviation, coefficient of variation

Symmetric and skewed distributions

Population summary measures Mean, variance, and standard deviation

Figure 2.2:

2.1. MEASUREOF CENTRAL TENDENCYORMEASURES OF LOCATION 3


Describing Data Numerically

Arithmetic Mean

Median

Mode

Describing Data Numerically

Variance

Standard Deviation

Coefficient of Variation

Range

Interquartile Range

Central Tendency Variation

Figure 2.3:

• We will consider the description of a data set, with and observations

for the sample and population respectively

• Often referred to as data reduction

2.1 Measure of Central Tendency or Measures of

Location



Measures of Central Tendency

Central Tendency

Mean Median Mode

n

xx

n

1ii

Overview

Midpoint of ranked values

Most frequently observed value

Arithmetic average

Figure 2.4:

2.1.1 Population Mean

=1

X=1



Arithmetic Mean

The arithmetic mean (mean) is the most common measure of central tendency

For a population of N values:

For a sample of size n:

Sample sizen

xxx

n

xx n21

n

1ii

Observed

values

N

xxx

N

xμ N21

N

1ii

Population size

Population values

Figure 2.5:

6 CHAPTER 2. DESCRIPTIVE STATISTICSParameter: A characteristic of a population

• Usually we do not have the population available

• The Kiers family owns four vehicles. Following is the current mileage for each:

56,000; 23,000; 42,000; and 73,000.

= (56 000 + 23 000 + 42 000 + 73 000)4 = 48 500

2.1.2 Sample Mean

=1

X=1

Statistic: A characteristic of a sample

• Sample mean is an estimator of the population mean .



Arithmetic Mean

The most common measure of central tendency

Mean = sum of values divided by the number of values Affected by extreme values (outliers)

(continued)

0 1 2 3 4 5 6 7 8 9 10

Mean = 3

0 1 2 3 4 5 6 7 8 9 10

Mean = 4

3515

554321

4520

5104321

Figure 2.6:

8 CHAPTER 2. DESCRIPTIVE STATISTICS• Notice that the unit for the sample mean is the same as the observations

themselves.

• Mean is sensitive to Outliers!

2.1.3 Combining Sample Means with different Sample Sizes:

Weighted Means

Suppose the sample mean in Economics 01 is 65 and that the mean male score is

62 and the female was 82. What is the proportion of female students in the class?

=1

X=1

=1

X=1

and

=1

( + )

+ X=1

.

We know that:

=

( + ) +

( + )

If we let be the proportion of females then:

= (1− ) +

65 = (1− ) × 62 + × 82

Solving for gives .15.

• Notice carefully to get the overall sum for the sample we must weight each of

the respective sample means (male and female) by their relative proportion

( ) of the sample.

• Weighted mean::

=X=1

whereX

= 1

• Question: If given unemployment rates for each of the 10 provinces how would

you calculate the unemployment rate for Canada?


2.1.4 Grouped Data

For grouped data we can calculate the sample mean as

=1

X=1

where we have classes and is themidpoint (class mark or midpoint) for class

i with frequency .


Approximations for Grouped Data

Suppose a data set contains values m1, m2, . . ., mk, occurring with frequencies f1, f2, . . . fK

For a population of N observations the mean is

For a sample of n observations, the mean is

N

mfμ

K

1iii

n

mf

x

K

1iii

K

1iifNwhere

K

1iifnwhere

Figure 2.7:

10 CHAPTER 2. DESCRIPTIVE STATISTICS2.1.5 Median-Md (2)

The median Md of a set of data is a number such that half of the observations

are below the number and half of the observations are above that number when the

data are ordered in an array from lowest to highest.


Median

In an ordered list, the median is the “middle”number (50% above, 50% below)

Not affected by extreme values

0 1 2 3 4 5 6 7 8 9 10

Median = 3

0 1 2 3 4 5 6 7 8 9 10

Median = 3

Figure 2.8:



Finding the Median

The location of the median:

If the number of values is odd, the median is the middle number

If the number of values is even, the median is the average of the two middle numbers

Note that is not the value of the median, only the

position of the median in the ranked data

dataorderedtheinposition2

1npositionMedian

21n

Figure 2.9:


Figure 2.10: Number of Movies Showing at Theater

M o v ie ss h o w in g

F r e q u e n c y C u m u la t iv eF r e q u e n c y

1 -2 1 1

3 -4 2 3

5 -6 3 6

7 -8 1 7

9 -1 0 3 1 0

2.1.6 Median with Grouped Data

For grouped data we get the rather complex equation (recall ogive approximation

method)

= +2−

×

where

= lower boundary of class containing the median

= width of class containing median

= frequency of class containing median2− =

2minus the cumulative frequency of the class preceding the class

containing the median and being the total number of observations.

The median class is 5-6, since it contains the 5th value (n/2 =5)

2.1.7 Notes

• If n is odd then the median is the middle value of the ordered array.

• If n is even, then add the two central values and divide by two

• Suppose we have data = 10 (even) that has been ordered smallest to

largest:

2 5 6 7 10 15 21 21 23 25

Median= (10+15)/2=12.5.

• Median is not so sensitive to outliers in the data

• Unique median for each data set.


2.1.8 Mode - Mo

The mode Mo of a set of numbers most frequently occurring value of the data set.

Ungrouped Data

2 4 2 8 7 mode is 2

Grouped Data: The midpoint (class mark) of the class containing the largest

number of class frequencies.

• Notice the mode need not be unique (i.e. bimodal)

• Not a good measure of central tendency

• Depends on frequency of observation rather than a balance over all observa-tions.


Mode

A measure of central tendency

Value that occurs most often

Not affected by extreme values Used for either numerical or categorical data

There may may be no mode

There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9

0 1 2 3 4 5 6

No Mode

Figure 2.11:

14 CHAPTER 2. DESCRIPTIVE STATISTICS2.1.9 Relationship between Mean, Median and Mode

• Symmetric Distribution:

Median=Mean=(Mode)

• Right Skewed Distribution (Positively skewed):

Mean and Median are to the right of the Mode⇒ ModeMedianMean

• Left Skewed Distribution (Negatively Skewed):

Mean and Median are to the left of the Mode⇒ MeanMedianMode

2.2. MEASURES OF VARIABILITY OR DISPERSIONS 15

2.2 Measures of Variability or Dispersions


Same center, different variation

Measures of Variability

Variation

Variance Standard Deviation

Coefficient of Variation

Range Interquartile Range

Measures of variation give information on the spread or variability of the data values.

Figure 2.12:

16 CHAPTER 2. DESCRIPTIVE STATISTICS2.2.1 Range

• The range is the difference between the largest and the smallest values of the

data.

RANGE = −

• A sample of five accounting graduates revealed the following starting salaries:

$22,000, $28,000, $31,000, $23,000, $24,000. The range is $31,000 - $22,000 =

$9,000.

• Problem with this measure of dispersion is that it depends upon only two

numbers and is particularly sensitive to outliers in the data.

• Interquartile Range The interquartile range is 3 - 1, this is the range of

the middle 50 % of the data.


Interquartile Range

Can eliminate some outlier problems by using the interquartile range

Eliminate high- and low-valued observations and calculate the range of the middle 50% of the data

Interquartile range = 3rd quartile – 1st quartileIQR = Q3 – Q1

Figure 2.13:



Interquartile Range

Median(Q2)

XmaximumX

minimum Q1 Q3

Example:

25% 25% 25% 25%

12 30 45 57 70

Interquartile range = 57 – 30 = 27

Figure 2.14:



Quartiles

Quartiles split the ranked data into 4 segments with an equal number of values per segment

25% 25% 25% 25%

The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger

Q2 is the same as the median (50% are smaller, 50% are larger)

Only 25% of the observations are greater than the third quartile

Q1 Q2 Q3

Figure 2.15:



Quartile Formulas

Find a quartile by determining the value in the appropriate position in the ranked data, where

First quartile position: Q1 = 0.25(n+1)

Second quartile position: Q2 = 0.50(n+1)(the median position)

Third quartile position: Q3 = 0.75(n+1)

where n is the number of observed values

Figure 2.16:



(n = 9)

Q1 = is in the 0.25(9+1) = 2.5 position of the ranked data

so use the value half way between the 2nd and 3rd values,

so Q1 = 12.5

Quartiles

Sample Ranked Data: 11 12 13 16 16 17 18 21 22

Example: Find the first quartile

Figure 2.17:

2.2.2 Deviation from the Mean

The deviation from the sample mean is

− = 1

We might consider one measure of variation as the Average Deviation:

1

X=1

[ − ]

except this is equal to 0 (recall proof in first chapter)

2.3. POPULATION VARIANCE 21

2.2.3 Mean (Absolute) Deviations (MAD)

=1

X=1

| − |

• Example: The weights of a sample of crates containing books for the book-

store are (in lbs.) 103, 97, 101, 106, 103. = 12/5 = 2.4

2.3 Population Variance

2 =1

X=1

[ − ]2

=1

X=1

2 − 2

Notes

• 2 ≥ 0

• The square root of the variance, is called the population standard deviation

or root-mean squared deviation of X.

• is measured in the same units as X.

Sample Variance

2 = 1−1

P=1[ − ]2

Alternatively and try to derive this

2 = 1−1 [

P=1

2 − 2]

2 =

P

=12−(P

)2

−1

Grouped data

2 =1

− 1X=1

[ − ]2



Approximations for Grouped Data

Suppose a data set contains values m1, m2, . . ., mk, occurring with frequencies f1, f2, . . . fK

For a population of N observations the variance is

For a sample of n observations, the variance is

N

μ)(mfσ

K

1i

2ii

2

1n

)x(mfs

K

1i

2ii

2

Figure 2.18:

• We can have two data sets with the same means but quite different sample

variances.

• 2 is measured in 2

• is measured in the same units as the observations themselves.



Calculation Example:Sample Standard Deviation

Sample Data (xi) : 10 12 14 15 17 18 18 24

n = 8 Mean = x = 16

4.24267

126

18

16)(2416)(1416)(1216)(10

1n

)x(24)x(14)x(12)X(10s

2222

2222

A measure of the “average”scatter around the mean

Figure 2.19:



Measuring variation

Small standard deviation

Large standard deviation

Figure 2.20:


Empirical Rule for Bell Shaped (Normal) Distributions

For data that give rise to histograms with the familiar (symmetric) bell-shaped

appearance then:

• 68% of data lies within the interval ±

• 95% of data lies within the interval ± 2

• 99.7% of data lies within the interval ± 3

26 CHAPTER 2. DESCRIPTIVE STATISTICS2.3.1 Empirical Rule for Standard Deviations


If the data distribution is bell-shaped, then the interval:

contains about 68% of the values in the population or the sample

The Empirical Rule

1σμ

μ

68%

1σμ

Figure 2.21:



contains about 95% of the values in the population or the sample

contains about 99.7% of the values in the population or the sample

The Empirical Rule Normal

2σμ

3σμ

3σμ

99.7%95%

2σμ

Figure 2.22:

28 CHAPTER 2. DESCRIPTIVE STATISTICSFor most data, 90% or more of the values lie within plus or minus three standard

deviations from the mean

2.4. COEFFICIENT OF VARIATION ( ) 29

2.4 Coefficient of Variation ( )

=

× 100% for population

=

× 100% for sample

• Adjusts the standard deviation (population or sample) by dividing by themean again either by population or sample)

• Unit free and allows comparisons of variations of similar objects that controlfor size.


Comparing Coefficient of Variation

Stock A: Average price last year = $50

Standard deviation = $5

Stock B:

Average price last year = $100 Standard deviation = $5

Both stocks have the same standard deviation, but stock B is less variable relative to its price

10%100%$50

$5100%

x

sCVA

5%100%$100

$5100%

x

sCVB

Figure 2.23:

30 CHAPTER 2. DESCRIPTIVE STATISTICS2.5 Covariances: Population and Sample


The Sample Covariance The covariance measures the strength of the linear relationship

between two variables

The population covariance:

The sample covariance:

Only concerned with the strength of the relationship

No causal effect is implied

N

))(y(xy),(xCov

N

1iyixi

xy

1n

)y)(yx(xsy),(xCov

n

1iii

xy

Figure 2.24:

2.5. COVARIANCES: POPULATION AND SAMPLE 31


Covariance between two variables:

Cov(x,y) > 0 x and y tend to move in the same direction

Cov(x,y) < 0 x and y tend to move in opposite directions

Cov(x,y) = 0 x and y are independent

Interpreting Covariance

Figure 2.25:

32 CHAPTER 2. DESCRIPTIVE STATISTICS2.6 Correlation: Population and Sample


Coefficient of Correlation

Measures the relative strength of the linear relationship between two variables

Population correlation coefficient:

Sample correlation coefficient:

YX ss

y),(xCovr

YXσσ

y),(xCovρ

Figure 2.26:

2.7. EXPLORATORY DATA ANALYSIS 33


Features of Correlation Coefficient, r

Unit free

Ranges between –1 and 1

The closer to –1, the stronger the negative linear relationship

The closer to 1, the stronger the positive linear

relationship

The closer to 0, the weaker any positive linear

relationship

Figure 2.27:

2.7 Exploratory Data Analysis

2.7.1 Some Remarks

• Often we want to display graphically a measure of central tendency and

dispersion and the simplest way to do this is by a box plot.

2.8 BOX-and-Whisker PLOTS

• Plots 5 numbers along a single axis: (Five Number Summary)

• Minimum and Maximum Values

• 1, 2 and 3

• Example:

Based on a sample of 20 deliveries, Marco’s Pizza determined the following in-

formation: minimum value = 13 minutes, Q1 = 15 minutes, median = 18 minutes,

Q3 = 22 minutes, maximum value = 30 minutes


Figure 2.28: Vertical Box Plots

Usually Box-and-Whisker Plots are horizontal, but for some reason many sta-

tistical programs (STATA and Excel) plot them vertically.

2.9 Mathematical Note: Linear Transformations

• Let and be two constants and = 1 be a variable.

• An example would be with kilometers, and miles, then = 0 and

= 06.

A linear transformation is defined as:

= +

is said to be defined by a linear combination of .

• We note that is the intercept and is the slope of the equation.

2.10. STANDARDIZED VARIABLE () 35

• Notice that changes the origin and that changes the scale.

• How is the mean and variance of the variable Y related to the mean and the

variance of X?

• First consider the population case:

Mean of a Linear Transformation

= 1

P=1

= 1

P=1[ + ]

= + 1

P=1

= +

Variance of a Linear Transformation

2 = 1

P=1[ − ]

2

= 1

P=1[ + − { + }]2

= 1

P=1[{ − }]2

= 2 1

P=1[ − ]

2

= 2 2

This implies

= ||It also follow that for the sample counterparts that

= +

2 = 2 2

= ||

2.10 Standardized Variable ()

• Finally we may define a standardized variable (in terms of the sample) as

= −

Notes on Standarizing

36 CHAPTER 2. DESCRIPTIVE STATISTICS— This variable has a mean of zero (prove this using the results above

on linear transformations) and a sample standard deviation and variance

equal to 1 ( = − , = 1

).

— Variables defined in this way are sometimes referred to as , Z-scores.—

= 0 2 = = 1

— We can do the same type of transformation in terms of the population

magnitudes and .

— Notice that if an observation is below the mean then the − is

negative and above if it is positive.

Questions

There are many problems on means, variances and so on in LMM and you

should make sure you are comfortable with the topics by doings as many as

you can.

— NCT 2.28 a-c, 2.21, 2.29a,b,d,

— Linear Transformations Show the relationship between centigrade and

Fahrenheit.

— What temperature measure shows greater variation and hence provides

more detail on temperature.

— In general comment upon the relationship between the variance of the

transformed variable and the original as it relates to the slope parameter.

Chapter 2 Descriptive Statistics - Queen's...

Documents

Transcript of Chapter 2 Descriptive Statistics - Queen's...