Objectives 1.2Describing distributions with numbers Measures of center: mean, median Mean versus...

30
Objectives 1.2 Describing distributions with numbers Measures of center: mean, median Mean versus median Measures of spread: quartiles, standard deviation Five-number summary and boxplot Choosing among summary statistics Changing the unit of measurement

Transcript of Objectives 1.2Describing distributions with numbers Measures of center: mean, median Mean versus...

Page 1: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Objectives

1.2 Describing distributions with numbers

Measures of center: mean, median

Mean versus median

Measures of spread: quartiles, standard deviation

Five-number summary and boxplot

Choosing among summary statistics

Changing the unit of measurement

Page 2: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Numerical descriptions of distributions

Describe the shape, center, and spread of a distribution…

Center: mean, median and mode.

Spread: range, IQR, standard deviation (SD).

We treat these as aids to understanding the distribution of the variable at hand…

The mean is often called the "average" and is in fact the arithmetic average ("add all the values and divide by the number of observations").

Page 3: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

The mean or arithmetic average

To calculate the average, or mean, add all

values, then divide by the number of

individuals. It is the “center of mass.”

height58.259.560.760.961.9

Measure of center: sample mean: Example 1

Sum of heights is 301.2

divided by 5 women = 301.2/5=60.24 inches

Page 4: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

x 1598.3

2563.9

Mathematical notation:(Sample mean)

x 1

n ixi1

n

woman(i)

height(x)

woman(i)

height(x)

i = 1 x1= 58.2 i = 14 x14= 64.0

i = 2 x2= 59.5 i = 15 x15= 64.5

i = 3 x3= 60.7 i = 16 x16= 64.1

i = 4 x4= 60.9 i = 17 x17= 64.8

i = 5 x5= 61.9 i = 18 x18= 65.2

i = 6 x6= 61.9 i = 19 x19= 65.7

i = 7 x7= 62.2 i = 20 x20= 66.2

i = 8 x8= 62.2 i = 21 x21= 66.7

i = 9 x9= 62.4 i = 22 x22= 67.1

i = 10 x10= 62.9 i = 23 x23= 67.8

i = 11 x11= 63.9 i = 24 x24= 68.9

i = 12 x12= 63.1 i = 25 x25= 69.6

i = 13 x13= 63.9 n= 25 =1598.3

Learn right away how to get the mean using your calculators.

x x1 x2 ... xn

n

Measure of center: sample mean: Example 2

Page 5: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Your numerical summary must be meaningful!

The distribution of women’s heights appears coherent and symmetrical. The mean is a good numerical summary.

9.63x

Height of 25 women in a class

Page 6: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

The Median (M) is often called the "middle" value and is the value at the midpoint of the observations when they are ranked from smallest to largest value….

Steps to get median: arrange the data from smallest to largest if n is odd then the median is the single observation in the

center (at the (n+1)/2 position in the ordering) if n is even then the median is the average of the two middle

observations (at the (n+1)/2 position; i.e., in between…) E.g1: 5, 1, 7, 4, 3 E.g2: 5, 1, 7, 4, 3, 8

Page 7: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Note: for a median, 50% of the data are less than it and 50% of the data are bigger than it

Example1: with the data listed below, what are the mean and median?

2, 3, 5, 1. Example2: with the data listed below, what are the mean and median?

2, 3, 5, 1, 100. Example3: with the data listed below, what are the mean and median? -100, 2, 3, 5, 1, 100.Question: What can we conclude from the examples above?

Measure of center: the median

Mean is sensitive to outliers;Median is robust to outliers.

Page 8: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Measure of center: the medianThe median is the midpoint of a distribution—the number such

that half of the observations are smaller and half are larger.

1. Sort observations by size.n = number of observations

______________________________

1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 8 2.39 9 2.510 10 2.811 11 2.912 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 8 4.722 9 4.923 10 5.324 11 5.6

n = 24 n/2 = 12

Median = (3.3+3.4) /2 = 3.35

2.b. If n is even, the median is the mean of the two middle observations.

1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 8 2.39 9 2.510 10 2.811 11 2.912 12 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 8 4.722 9 4.923 10 5.324 11 5.625 12 6.1

n = 25 (n+1)/2 = 26/2 = 13 Median = 3.4

2.a. If n is odd, the median is observation (n+1)/2 down the list

Page 9: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

The median, on the other hand,

is only slightly pulled to the right

by the outliers (from 3.4 to 3.6).

The mean is pulled to the

right a lot by the outliers

(from 3.4 to 4.2).

P

erc

en

t o

f p

eo

ple

dyi

ng

Mean and median of a distribution with outliers

4.3x

Without the outliers

2.4x

With the outliers

Page 10: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Disease X:

Mean and median are the same.

Mean and median of a symmetric

4.3

4.3

M

x

Multiple myeloma:

5.2

4.3

M

x

… and a right-skewed distribution

The mean is pulled toward the skew.

Impact of skewed data

Page 11: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

We can describe the shape, center and spread of a density curve in the same way we describe data… e.g.,

the median of a density curve is the “equal-areas” point - the point on the horizontal axis that divides the area under the density curve into two equal (.5 each) parts.

The mean of the density curve is the balance point - the point on the horizontal axis where the curve would balance if it were made of a solid material. (See figures 1.24b and 1.25 below)

Page 12: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Skewness: The mean is pulled toward the skew.

Mode = Mean = Median

SKEWED LEFT(negatively)

SYMMETRIC

Mean Mode Median

SKEWED RIGHT(positively)

Mean Mode Median

The mean is pulled toward the skew.

Page 13: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Spread: percentiles, quartiles (Q1 and Q3), IQR,5-number summary (and boxplots), range, standard deviation

pth percentile of a variable is a data value such that p% of the values of the variable fall at or below it.

The lower (Q1) and upper (Q3) quartiles are special percentiles dividing the data into quarters (fourths). get them by finding the medians of the lower and upper halves of the data

IQR = interquartile range = Q3 - Q1 = spread of the middle 50% of the data. IQR is used with the so-called 1.5*IQR criterion for outliers - know this!

Measure of spread: the quartiles

Page 14: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Eg1: Dataset: 3, 2, 1, 5, 6.

1) Find the Median, Q1, Q3 and IQR.

2) Find the 5-# summary.

3) Draw a Boxplot for Eg1.

Examples to find 5-# summary and Boxplot

Eg2: Dataset: 3, 2, 1, 5, 6, 8.

1) Find the Median, Q1, Q3 and IQR.

2) Find the 5-# summary.

3) Draw a Boxplot for Eg1.

Page 15: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Definition, pg 35Introduction to the Practice of Statistics, Sixth Edition

© 2009 W.H. Freeman and Company

Measure of spread: the quartiles

Page 16: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

M = median = 3.4

Q1= first quartile = 2.2

Q3= third quartile = 4.35

1 1 0.62 2 1.23 3 1.54 4 1.65 5 1.96 6 2.17 7 2.38 1 2.39 2 2.510 3 2.811 4 2.912 5 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 1 4.722 2 4.923 3 5.324 4 5.625 5 6.1

Measure of spread: the quartiles

The first quartile, Q1, is the value in the

sample that has 25% of the data at or

below it ( it is the median of the lower

half of the sorted data, excluding M).

The third quartile, Q3, is the value in the

sample that has 75% of the data at or

below it ( it is the median of the upper

half of the sorted data, excluding M).

Page 17: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Definition, pg 37Introduction to the Practice of Statistics, Sixth Edition

© 2009 W.H. Freeman and Company

Definition, pg 38aIntroduction to the Practice of Statistics, Sixth Edition

© 2009 W.H. Freeman and Company

Page 18: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

M = median = 3.4

Q3= third quartile = 4.35

Q1= first quartile = 2.2

25 6 6.124 5 5.623 4 5.322 3 4.921 2 4.720 1 4.519 6 4.218 5 4.117 4 3.916 3 3.815 2 3.714 1 3.613 3.412 6 3.311 5 2.910 4 2.89 3 2.58 2 2.37 1 2.36 6 2.15 5 1.94 4 1.63 3 1.52 2 1.21 1 0.6

Largest = max = 6.1

Smallest = min = 0.6

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary:

min Q1 M Q3 max

Five-number summary and boxplot

BOXPLOT

Page 19: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

0123456789

101112131415

Disease X Multiple Myeloma

Yea

rs u

ntil

deat

h

Comparing box plots for a normal and a right-skewed distribution

Boxplots for skewed data

Boxplots remain

true to the data and

depict clearly

symmetry or skew.

Page 20: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

5-number summary: min. , Q1, median, Q3, maxwhen plotted, the 5-number summary is a boxplot we can also

do a modified boxplot to show outliers (mild and extreme). Boxplots have less detail than histograms and are often used for comparing distributions… e.g., Fig. 1.17, p.47 and below...

Page 21: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Suspected outliers: how to detect outliersOutliers are troublesome data points, and it is important to be able to

identify them.

One way to raise the flag for a suspected outlier is to compare the

distance from the suspicious data point to the nearest quartile (Q1 or Q3).

We then compare this distance to the interquartile range (distance

between Q1 and Q3).

We call an observation a suspected outlier if it falls more than 1.5 times

the size of the interquartile range (IQR) above the first quartile or below

the third quartile. This is called the “1.5 * IQR rule for outliers.”

Page 22: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Modified Boxplot Modified boxplot (helps detect outliers)

Calculate 1.5*IQR Q1 – 1.5*IQR

Q3+1.5*IQR

Draw box and line (similar to before). Draw whiskers to minimum and maximum observation

within (Q1 – 1.5*IQR, Q3+1.5*IQR). Observations outside this range should be plotted as

dots separately.

Page 23: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Q3 = 4.35

Q1 = 2.2

25 6 7.924 5 6.123 4 5.322 3 4.921 2 4.720 1 4.519 6 4.218 5 4.117 4 3.916 3 3.815 2 3.714 1 3.613 3.412 6 3.311 5 2.910 4 2.89 3 2.58 2 2.37 1 2.36 6 2.15 5 1.94 4 1.63 3 1.52 2 1.21 1 0.6

Modified Boxplot

Q1: Is there any suspected outliers?

Q2: If yes, then find the following values: Calculate 1.5*IQR; Lower bound = Q1 – 1.5*IQR;

Upper bound = Q3+1.5*IQR; Find Min*=min within lower/upper

bounds; Find Max*=max within lower/upper

bounds;

Q3: Can we verify any outliers?

Q4: Now draw the Modified Boxplot: Draw Min* and Max*, Q1, Med, Q3. For all observations outside this range

should be plotted as dots separately.

Page 24: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Q3 = 4.35

Q1 = 2.2

25 6 7.924 5 6.123 4 5.322 3 4.921 2 4.720 1 4.519 6 4.218 5 4.117 4 3.916 3 3.815 2 3.714 1 3.613 3.412 6 3.311 5 2.910 4 2.89 3 2.58 2 2.37 1 2.36 6 2.15 5 1.94 4 1.63 3 1.52 2 1.21 1 0.6

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile rangeQ3 – Q1

4.35 − 2.2 = 2.15

Distance to Q3

7.9 − 4.35 = 3.55

Individual #25 has a value of 7.9 years, which is 3.55 years above the

third quartile. This is more than 3.225 years, 1.5 * IQR. Thus, individual

#25 is an outlier by our 1.5 * IQR rule.

Modified Boxplot

Page 25: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

The standard deviation “s” is used to describe the variation around the mean. Like the mean, it is not resistant to skew or outliers.

2

1

2 )(1

1xx

ns

n

i

1. First calculate the variance s2.

2

1

)(1

1xx

ns

n

i

2. Then take the square root to get

the standard deviation s.

Measure of spread: the standard deviation

Mean± 1 s.d.

x

Page 26: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Calculations …For data: 1, 2, 3, 4, 5. Q: Find the sample variance and sample SD.

Make sure to know how to get the standard deviation using your calculator.

2

1

)(1

xxdf

sn

i Mean = 3

Sum of squared deviations from mean = 10

Degrees freedom (df) = (n − 1) = 4

s2 = sample variance = 10/4 = 2.5

s = sample standard deviation

= √2.5 = 1.58

Example 1: to calculate sample SD

1

1

Order i

Page 27: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Make sure to know how to get the standard deviation using your calculator.

Example 2: Use hand to calculate sample SD for the following data set: 3, 4, 5, 8.

2

1

2 )(1

1xx

ns

n

i

1. First calculate the variance s2.

2

1

)(1

1xx

ns

n

i

2. Then take the square root to get

the standard deviation s.

Page 28: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

How to use calculator to find statistics… In order to find sample mean, sample SD, and 5-# summary, we can

use calculator to help as following: Stat Edit choose 1: Edit… input your data into L1; Stat Calc choose 1: 1-Var Stats Enter Enter. Read your outputs carefully.

Note: X-bar means sample mean; Sx means sample SD; n means sample size.

Q: find the sample mean, sample SD, and 5-# summary for the following data:

Example1: Data are: 3, 4, 5, 8. Example 2: Data are: 1, 3, 5, 6, 7, 8.

Page 29: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

Definition, pg 43aIntroduction to the Practice of Statistics, Sixth Edition

© 2009 W.H. Freeman and Company

Page 30: Objectives 1.2Describing distributions with numbers  Measures of center: mean, median  Mean versus median  Measures of spread: quartiles, standard deviation.

ALWAYS PLOT DATA BEFORE DECIDING ON A NUMERICAL SUMMARY.

How to choose summary statistics? Use: 5-number summary is better than the mean and s.d.

for skewed data; Use mean & s.d. for symmetric data.

How to perform data analysis: