Math146 - Chapter 3 Handouts - CoffeeCup Software

Math146 - Chapter 3 Handouts

of 39

The Greek Alphabet Source: www.mathwords.com


of 39

Some Miscellaneous Tips on Calculations

Examples: Round to the nearest thousandth

0.92431

0.75693 CAUTION! Do not truncate numbers!

Example: 1

6= 0.166666…

A common mistake is to truncate this decimal, and write it as: Round it off correctly (say, to three decimal places) as:

However, “in-between” zeros DO count as significant digits. Examples: Round to three significant figures

0.20361

0.00059254 A FINAL CAUTION! Be very careful to not “overly” round off intermediate

calculations, if you are going to use those numbers in a subsequent calculation.

A better method is to store those values on your calculator (using the memory

registers), OR to just do the calculation using a single command which will

probably involve the use of a lot of parentheses ( ).


of 39

Section 3.1 Measures of Central Tendency Descriptive measures: used to describe data sets.

Measure of center – a value at the of a data set

Three different measures of center:

1.

2.

3.

1. Mean

Probably the most commonly used measure of center.

Same as

Add the values and divide by the total number of values.

Write in symbols as:

where, xi = the data values

n = number of values in the sample

is the uppercase Greek letter ‘sigma’, summation symbol, means

“add all this stuff up”.

A parameter is a descriptive measure for a .

A statistic is a descriptive measure for a .


of 39

Sample statistics Population parameters

Number of Data Values

Symbol for mean

Formula for mean

Round-Off Rule: Round off your final answer to than

is present in the original set of data.

Example: Contents of a sample of cans of regular Coke have the following

weights in lbs:

0.8192 0.8150 0.8163 0.8211 0.8181 0.8247

Advantage of using mean as the measure of center for a data set:

Takes every into account

Much statistical inference that will be performed is based on the mean

Disadvantage:

Can be dramatically affected by a few .

Example: What People Earn (see next page) Is the mean a good measure of center for this data set?


of 39

What People Earn* Annual Salary Sorted Data

Admin. Clerk $38,000 1 $20,000

Real estate agent $103,100 2 $20,000

Professional golfer $5,500,000 3 $23,500

Dogwalker $20,000 4 $25,000

High school counselor $58,900 5 $26,000

Mechanical engineer $47,900 6 $38,000

Mechanical engineer $46,000 7 $39,500

Health-care director $68,000 8 $40,000

Bridal salon owner $25,000 9 $42,000

Private investigator $210,000 10 $46,000

Part-time acupuncturist $40,000 11 $46,000

Surgery resident $39,500 12 $47,900

Housekeeping aide $23,500 13 $54,600

Migrant family liaison $26,000 14 $58,900

Court clerk $54,600 15 $68,000

Sales manager $180,000 16 $103,100

Deputy sheriff $46,000 17 $180,000

Fishing guide $42,000 18 $210,000

Radio host $32,000,000 19 $5,500,000

Lay pastor $20,000 20 $32,000,000

=

Mean =

Median =

* Source: Parade (Tri-City Herald), March 2004

Note: This data is NOT randomly selected – just some data I happened to pick from one

particular page.


of 39

2. Median

Can help overcome the disadvantage of the mean being dramatically

affected by .

Median is physically the when the original

data values are sorted in order of increasing order.

Symbol is for a median

Procedure to find median:

Sort data in order.

If odd number of values, median =

If even number of values, median =

Example: What People Earn (see previous page)

The median is .

Resistant measure – not sensitive to the influence of a few

.

Because of this, the median is frequently used for particular types of data,

instead of the mean.


of 39

Example: Tri-City Housing Market

Source: https://www.tricitiesbusinessnews.com/2019/01/housing-market-slowing-down/


of 39

Example: US Household Income, 2018

Source: https://www.census.gov/library/stories/2019/09/us-median-household-income-up-in-2018-from-2017.html

Why do we use median instead of mean for data like housing prices and income?

Consider the population of:

All home selling prices in the Tri-Cities

All household incomes in the U.S.

The median is useful to eliminate the effect of the .

https://www.census.gov/library/stories/2019/09/us-median-household-income-up-in-2018-from-2017.html


of 39

The is commonly quoted instead of the

for data that is strongly , such as:

Example: TV commercial for Abreva

“Studies show 4.1 days median healing time”.


of 39

3.

Is the value that occurs

Used mainly for , not numerical data

Can have more than one , if more than one value

occurs with the greatest frequency.

If no value is repeated, there is .

Example: Class survey, people who have a tattoo or not.

people said yes people said no

The mode is .

Example: What People Earn (see previously)

Mode: Data set is: Is the mode useful here?


of 39

How to determine the most appropriate measure of center :

If there are no affecting the mean,

and if the distribution is fairly ,then the

mean is the most appropriate choice for measure of center, because it

takes of the data values into consideration.

If there extreme values affecting the mean, and if the

distribution is fairly in either direction, then the

median might be a better choice, because it is

to the extreme values, and provides the best “typical” value of the data

set.

If the data consists of qualitative data, then the is the

only appropriate measure of center.

The mode is not commonly used with data.

For one reason, if you have continuous data, it is very possible that your

data set will not have a mode, because there are no repeated values. The

only time you would use it numerical set of data is if you specifically want

to know what the most frequent data value is.


of 39

1. From the class survey, here are the responses for how many of the United

States you have visited (combined with Math 146 Spring 2019 responses):

No. of states

1 4 9 2 5 9 2 5 9 2 5 10 3 5 10 3 5 10 3 5 12 3 6 12 3 6 13 3 6 15 3 6 16 4 6 21 4 7 22 4 7 23 4 7 26 4 8 30 4 8 34 4 8 35 4 8 35 4 8 40

Number of states

Notice that there are a total of n = 60 data values. 1. Make a dotplot of the data (below), and use the plot to

describe the distribution of the data. 2. Find the following values: Mean = Median = Mode = 3. Which measure of center would you quote for this dataset,

and why?


of 39

2. Answer and explain (briefly) your answers to the following. Note: one

way to explain is by coming up with a small example data set.

a. Is it possible that the mean might not equal any of the values in a data set?

b. Is it possible that the median might not equal any of the values in a data set?

c. Is it possible that the mean might be smaller than all of the values in a data set?

d. Is it possible that the median might be smaller than all of the values in a data

set?

e. Is it possible that the mean might be larger than only one value in a data set?

f. Is it possible that the median might be larger than only one value in a data set?

3. Think about all of the human beings alive at this moment.

a. Which value do you think is greater: the mean age or the median age of

all human beings alive at this moment? Explain your answer.

b. Which value do you think is greater: the mean age of all human beings

alive at this moment, or the mean age of all Americans alive at this

moment? Explain.

c. Estimate the median age of all human beings alive at this moment.

Estimate the median age of all Americans alive at this moment.


of 39

Section 3.2 Measures of Dispersion

Dispersion is the degree to which the data are

Example: Class Grades, 10 sample test grades from two sections

Section 1: 26, 43, 57, 64, 65, 79, 82, 88, 92, 104

Section 2: 50, 57, 66, 68, 70, 70, 72, 75, 82, 90


of 39

Different Measures of Dispersion/Variation:

1.

2.

3.

4.

1.

Difference between value and value

To calculate: range =

Benefit: Easy to compute

Drawback: as other measures of variation

because it depends only on highest and lowest values.

Example: Section 1: Range = Section 2: Range = 2.

Preferred measure of variation when the is used as

the measure of center.

Measures the variation of all sample values

Larger values of standard deviation indicate

Only equals 0 if all data values are the

Units are the same as the units of the original data

A drawback is that it is not , because its value

can be strongly affected by a few extreme data values.


of 39

Formula:

where, s = standard deviation of a

xi =

x = mean of values

n = number of data values in the

Example: Section 1 Grades

xi

(data values)

xi - x

(xi - x )2

26

43

57 -13 169

64 -6 36

65 -5 25

79 9 81

82 12 144

88 18 324

92 22 484

104 34 1156

n = 10 data values

Sum of squared deviations =

=

Sample variance: Sample standard deviation: Same round-off rule: One decimal place than the original

data for the final answer.


of 39

In STATDISK: Data/Explore Data – Descriptive Statistics

Section 1: Section 2:

Could tell just by looking at the graphs that Section 1 was more spread out.

Now we have an actual measure of that variation.

Standard deviation for Section 2 is much smaller, because the values are not

spread as far apart, in general the values are closer to the .


of 39

Sample Statistics Population Parameters

Number of Data Values n N

Mean n

xx

N

x

Standard deviation

Variance

KEY!!! Standard deviation and variance are closely related – if you have one,

you can calculate the other!

Standard deviation = Variance =


of 39


of 39

Empirical Rule

Notice that this rule applies to data having an approximately

distribution. It can be used to determine the percentage of data that will lie within

a certain number of standard deviations of the mean.

1. About of all data values fall within 1 standard deviation of the mean.

2. About of all data values fall within 2 standard deviations of the mean.

3. About of all data values fall within 3 standard deviations of the mean.

.

Note: Can also be used assuming population parameters µ and .


of 39

Example: Using the Empirical Rule

Men’s pulse data from STATDISK

x =

s =

x – s =

x + s =

From the Empirical Rule, would expect about 68% of the

data values to fall within the range of to

.

No. of values in this range =

x – 2s =

x + 2s =

From the Empirical Rule, would expect about 95% of the

data values to fall within the range of to

.

No. of values in this range = From the Range Rule of Thumb, are there any unusual

Men’s Pulse Data

(Data Set 1, 12th ed.)

Pulse (bpm)

1 46

2 50

3 52

4 54

5 56

6 56

7 58

8 58

9 60

10 60

11 60

12 60

13 62

14 62

15 64

16 64

17 64

18 66

19 66

20 66

21 68

22 68

23 68

24 68

25 68

26 70

27 70

28 70

29 72

30 74

31 74

32 74

33 76

34 78

35 80

36 80

37 84

38 86

39 88

40 90


of 39

The following table reports the daily high temperatures (F) in February 2006

for three locations.

Feb Date

Lincoln, Neb

San Luis Obispo, CA

Sedona, AZ

1 53 68 62

2 59 69 64

3 40 77 62

4 36 68 66

5 36 76 61

6 44 71 61

7 46 79 68

8 34 85 68

9 41 87 63

10 39 67 66

11 27 75 57

12 30 81 62

13 61 83 66

14 68 64 63

15 41 57 55

16 26 57 51

17 11 53 48

18 14 54 48

19 28 52 47

20 47 58 49

21 51 58 50

22 53 67 55

23 48 70 61

24 65 67 62

25 36 65 64

26 53 64 67

27 71 63 66

28 73 62 57

1. Use the dot plots of the data below to

rank the three locations in order of

smallest to largest standard deviation:

Smallest std. dev:

Middle std. dev:

Largest std. dev:

2. On the next page, calculate (by hand)

the standard deviation for the

temperature data from San Luis Obispo.

Std. dev. =

3. Using the standard deviation for the

San Luis Obispo data, estimate the

standard deviation for the other two

locations:

Lincoln:

Sedona:


of 39

Calculate the standard deviation for the San Luis Obispo data set, using the

following table:

Day High Temp. (F) xi – �̅� (xi – �̅�)2

1 68

2 69

3 77

4 68

5 76

6 71 3.25 10.5625

7 79 11.25 126.5625

8 85 17.25 297.5625

9 87 19.25 370.5625

10 67 -0.75 0.5625

11 75 7.25 52.5625

12 81 13.25 175.5625

13 83 15.25 232.5625

14 64 -3.75 14.0625

15 57 -10.75 115.5625

16 57 -10.75 115.5625

17 53 -14.75 217.5625

18 54 -13.75 189.0625

19 52 -15.75 248.0625

20 58 -9.75 95.0625

21 58 -9.75 95.0625

22 67 -0.75 0.5625

23 70 2.25 5.0625

24 67 -0.75 0.5625

25 65 -2.75 7.5625

26 64 -3.75 14.0625

27 63 -4.75 22.5625

28 62 -5.75 33.0625

Sum =

Variance = Standard Deviation =


of 39

Section 3.4 Measures of Position and Outliers

In this section, will introduce a number measures of position, which describe the

of a certain data value within the

entire set of data.

z-Scores

z-Scores are standardized values:

z score equals the number of standard deviations that a given data

value is above or below the

z score is if value is greater than mean, z score is

if value is less than mean

can use z-score to identify

To calculate: For a sample: z = For a population: z = where x = the particular data value.

Round z scores off to decimal places.

The z-scores allow us to compare values from different data sets by providing a

standard basis of comparison.


of 39

Example: Comparing Standardized Test Scores

Suppose a college admissions office needs to compare scores of students who

take the Scholastic Aptitude Test (SAT) with those who take the American

College Test (ACT). Among the college’s applicants who take the SAT, scores

have a mean of 1500 and a standard deviation of 240. Among the college’s

applicants who take the ACT, scores have a mean of 21 and a standard

deviation of 6.

Mike scored 1740 on the SAT, and Packard scored 30 on the ACT.

Who did relatively better on their test?

Standardize the comparison by calculating the z-scores:

Mike: z = Packard: z =


of 39

Identifying Outliers Using z-scores

Note that this method applies only to distributions that are fairly

, because it is based on the Empirical Rule.

Within a particular data set, the z-score is useful for giving us some idea of the

relative standing of a particular data value:

If a data value has a z-score fairly near 0, it is to the mean,

a very data value.

If a data value has a z-score of less than -2, or greater than +2, that

means it is very from the mean, and very far

away from most of the data values. That would be a less typical data

value.

Ordinary or “usual” values:

Unusual values:

-3 -2 -1 0 1 2 3

z score

ordinary unusual unusual


of 39

Percentiles and Quartiles

Percentiles

Percentiles are numbers that divide a data set into equal

parts, with about of the data in each part.

A data set has percentiles:

The interpretation the “kth percentile” of an observation means that

of the observations are less than or equal to the observation.

Example: a data set with 500 values in it, sorted in ascending order

Percentiles would divide it into 100 groups, with 5 data values in each group.

1st 2nd 3rd 4th 5th | 6th 7th 8th 9th 10th | … | 496th 497th 498th 499th 500th

Quartiles – values that divide the data into four roughly equal groups.

There are three Quartiles, Q1, Q2 and Q3

Note that the data has to be sorted in ascending order.

Q1 separates the bottom of the values

Q2 is the – separates bottom from top

Q3 separates the top .


of 39

Note:

If the number of observations in the data set is odd,

include the median when determining Q1 and Q3.

Quartiles are a measure.


of 39

Example: For twelve data values (even) sorted in ascending value:

2 5 6 10 15 17 24 27 27 28 30 31

Q2 = median = .

Q1 = median of bottom half = .

Q3 = median of top half = .

Example: For eleven data (odd) values sorted in ascending value:

5 6 10 15 17 24 27 27 28 30 31

Q2 = median = .

Q1 = median of bottom half = .

Q3 = median of top half = .

Note that different textbooks or statistical software packages may have slightly

different methods on how to find the quartiles.

STATDISK will always give the same results for the quartiles as the method in

our textbook as long as there are an even number of data values.


of 39

Interquartile Range (IQR)

The interquartile range is another measure of .

IQR =

IQR represents the range of values over which of the data is spread.

Outliers

An outlier is a data value that is from the other data

values, an extreme observation. An outlier could be the result of:

An (measurement, sampling, or recording)

Just an unusually observation

Checking for outliers using Quartiles:

Calculate the fences, cutoff points for determining outliers:

Lower fence =

Upper fence =

A data value is considered an outlier if:

It is the lower fence, or

It is the upper fence.

Checking

for Outliers


of 39

Example: Natural Selection (source: Workshop Statistics, 4th edition)

A landmark study on the topic of natural selection was conducted by Hermon

Bumpus in 1898. Bumpus gathered extensive data on house sparrows that were

brought to the Anatomical Laboratory of Brown University in Providence, Rhode

Island, following a particularly severe winter storm. Some of the sparrows were

revived, but some sparrows perished. Bumpus analyzed his data to investigate

whether or not those that survived tended to have distinctive physical

characteristics related to their fitness.

The following sorted data are the total length measurements (in millimeters, from

the tip of the sparrow’s beak to the tip of its tail) for the 24 adult males that died

and the 35 adult males that survived. (note: I also added a column of numbers

next to each data list, just to identify the data values)

Sparrow Died Sparrow Lived

Minimum

Q1

Q2

Q3

Maximum

Interquartile Range (IQR)

Lower Fence

Upper Fence

Note: the first five rows that you filled out are called the:


of 39

Sparrow

Sparrow

Died

Lived Length (mm)

Length (mm)

1 156 1 153

2 158 2 154

3 160 3 155

4 160 4 155

5 160 5 156

6 161 6 156

7 161 7 157

8 161 8 157

9 161 9 158

10 161 10 158

11 162 11 158

12 162 12 158

13 162 13 158

14 162 14 159

15 162 15 159

16 162 16 159

17 163 17 159

18 163 18 159

19 164 19 160

20 165 20 160

21 165 21 160

22 165 22 160

23 166 23 160

24 166 24 160 25 160 26 160 27 160 28 161 29 161 30 161 31 161 32 162 33 163 34 165 35 166


of 39

Boxplots

This is a graphical display of the data based on the .

Many books call what we are going to make a “modified boxplot”, because we

are going to indicate the outliers on our graph.


of 39

Sparrow died :

Sparrow lived :

Conclusions? 1. What do the boxplots reveal, as far as whether or not there appears to be

a difference in lengths between the sparrows that survived and the sparrows that

died?

2. What type of a study was this: observational, or designed experiment? 3. As such, can we conclude that being shorter caused the sparrows to be

more likely to survive the storm?


of 39

PULSE RATES - MEN

Men's Pulse Rates –

SORTED (bpm)

1 46

2 50

3 52

4 54

5 56

6 56

7 58

8 58

9 60

10 60

11 60

12 60

13 62

14 62

15 64

16 64

17 64

18 66

19 66

20 66

21 68

22 68

23 68

24 68

25 68

26 70

27 70

28 70

29 72

30 74

31 74

32 74

33 76

34 78

35 80

36 80

37 84

38 86

39 88

40 90

1. Take your own pulse rate for one minute:

Pulse = beats per minute

2. Calculate your z-score.

z =

Assuming the distribution is bell-shaped, is your data value an

outlier?

3. Find the quartiles (do NOT add your data value):

Q1 =

Q2 =

Q3 =

4. Find the IQR (interquartile range):

IQR =

5. Identify any outliers (circle them):

Lower fence =

Upper fence =


of 39

PULSE RATES - WOMEN

Women's Pulse Rates –

SORTED (bpm)

1 56

2 60

3 62

4 62

5 64

6 64

7 66

8 68

9 68

10 72

11 72

12 72

13 72

14 72

15 72

16 72

17 74

18 74

19 76

20 76

21 78

22 78

23 78

24 78

25 78

26 78

27 78

28 80

29 82

30 82

31 82

32 88

33 90

34 90

35 90

36 96

37 98

38 98

39 100

40 104

1. Take your own pulse rate for one minute:

Pulse = beats per minute

2. Calculate your z-score.

z =

Assuming the distribution is bell-shaped, is your data value an

outlier?

3. Find the quartiles (do NOT add your data value):

Q1 =

Q2 =

Q3 =

4. Find the IQR (interquartile range):

IQR =

5. Identify any outliers (circle them):

Lower fence =

Upper fence =


of 39

Women’s Pulse Data: Men’s Pulse Data:

mean = mean =

standard deviation = standard deviation =

Who has the relatively higher pulse rate (compared to their respective populations):

A woman with a pulse rate of 100 bpm, or a man with a pulse rate of 88 bpm? Check by

calculating the z-score for both.

Women’s 5-number summary: Men’s 5-number summary:

Min = Min =

Q1 = Q1 =

Q2 = Q2 =

Q3 = Q3 =

Max = Max =

Boxplot of Women’s Pulse Rate Data (indicating potential outliers):

Boxplot of Men’s Pulse Rate Data (indicating potential outliers):

Pulse Rate (bpm)

What shape do the distributions appear to be, and what conclusions can you make based

on the comparison of the two boxplots?

Math146 - Chapter 3 Handouts - CoffeeCup Software

Documents

Transcript of Math146 - Chapter 3 Handouts - CoffeeCup Software