Sec 3.1 Measures of Central Tendency - UTRGV Faculty...

11
Sec 3.1 Measures of Central Tendency The population arithmetic mean, μ (pronounced “mew”), is computed using all the individuals in a population. The population mean is a parameter If x 1 , x 2 , …, x N are the N observations of a variable from a population, then the population mean, μ, is = If x 1 , x 2 , …, x n are the n observations of a variable from a sample, then the sample mean, ̅ , is ̅ = The median of a variable is the value that lies in the middle of the data when arranged in ascending order. We use M to represent the median. Steps in Finding the Median of a Data Set: If the number of observations is odd, then the median is the data value exactly in the middle of the data set. That is, the median is the observation that lies in then (n + 1)/2 position If the number of observations is even, then the median is the mean of the two middle observations in the data set. That is, the median is the mean of the observations that lie in the n/2 position and the n/2 + 1 position. Example: The following data represent the travel times (in minutes) to work for all seven employees of a start-up web development company. 23, 36, 23, 18, 5, 26, 43 Determine the median of this data.

Transcript of Sec 3.1 Measures of Central Tendency - UTRGV Faculty...

Page 1: Sec 3.1 Measures of Central Tendency - UTRGV Faculty Webfaculty.utrgv.edu/lakshmi.roychowdhury/SPRING, 2017... ·  · 2017-02-07Sec 3.1 Measures of Central Tendency ... The population

Sec 3.1 Measures of Central Tendency

The population arithmetic mean, μ (pronounced “mew”), is computed using all the individuals in a population. The population mean is a parameter

If x1, x

2, …, x

N are the N observations of a variable from a population, then the

population mean, µ, is

� = ∑ ������

If x1, x

2, …, x

n are the n observations of a variable from a sample, then the sample

mean, �̅ , is

�̅ = ∑ �������

The median of a variable is the value that lies in the middle of the data when arranged in ascending order.

We use M to represent the median.

Steps in Finding the Median of a Data Set:

• If the number of observations is odd, then the median is the data value exactly in the

middle of the data set. That is, the median is the observation that lies in then (n + 1)/2

position

• If the number of observations is even, then the median is the mean of the two middle

observations in the data set. That is, the median is the mean of the observations that lie in

the n/2 position and the n/2 + 1 position.

Example: The following data represent the travel times (in minutes) to work for all seven

employees of a start-up web development company.

23, 36, 23, 18, 5, 26, 43

Determine the median of this data.

Page 2: Sec 3.1 Measures of Central Tendency - UTRGV Faculty Webfaculty.utrgv.edu/lakshmi.roychowdhury/SPRING, 2017... ·  · 2017-02-07Sec 3.1 Measures of Central Tendency ... The population

Example: Find the mean and median. Use the mean and median to identify the shape of the distribution. Verify your result by drawing a histogram of the data.

The following data represent the asking price of homes for sale in Lincoln, NE.

The mean asking price is $168,320 and the median asking price is $148,700. Therefore, we would conjecture that the distribution is skewed right.

The mode of a variable is the most frequent observation of the variable that occurs in the data set.

A set of data can have no mode, one mode, or more than one mode.

If no observation occurs more than once, we say the data have no mode.

Page 3: Sec 3.1 Measures of Central Tendency - UTRGV Faculty Webfaculty.utrgv.edu/lakshmi.roychowdhury/SPRING, 2017... ·  · 2017-02-07Sec 3.1 Measures of Central Tendency ... The population

Sec 3.2 Measures of Dispersion

The range, R, of a variable is the difference between the largest data value and the smallest data values. Range = R = Largest Data Value – Smallest Data Value

The population standard deviation is symbolically represented by σ (lowercase Greek sigma).

= �∑��� � ���

where x1, x

2, . . . , x

N are the N observations in the population and μ is the population

mean. The short-cut formula is following:

The sample standard deviation, s, of a variable is s=�∑�����̅����� where x1, x

2, . . . , x

n are

the n observations in the sample and �̅ is the sample mean. The short-cut formula is following:

Example: The following data represent the travel times (in minutes) to work for all seven employees of a start-up web development company.

23, 36, 23, 18, 5, 26, 43

Compute the population standard deviation of this data.

The Empirical Rule:

If a distribution is roughly bell shaped, then

• Approximately 68% of the data will lie within 1 standard deviation of the mean. That is,

approximately 68% of the data lie between

μ – 1σ and μ + 1σ.

Page 4: Sec 3.1 Measures of Central Tendency - UTRGV Faculty Webfaculty.utrgv.edu/lakshmi.roychowdhury/SPRING, 2017... ·  · 2017-02-07Sec 3.1 Measures of Central Tendency ... The population

• Approximately 95% of the data will lie within 2 standard deviations of the mean. That is,

approximately 95% of the data lie between

μ – 2σ and μ + 2σ.

• Approximately 99.7% of the data will lie within 3 standard deviations of the mean. That

is, approximately 99.7% of the data lie between μ – 3σ and μ + 3σ.

Note: We can also use the Empirical Rule based on sample data with used in place of μ and s used in place of σ.

Example: The following data represent the serum HDL cholesterol of the 54 female patients of a family doctor.

(a) Compute the population mean and standard deviation.

(b) Draw a histogram to verify the data is bell-shaped.

(c) Determine the percentage of all patients that have serum HDL within 3 standard deviations of the mean according to the Empirical Rule.

Page 5: Sec 3.1 Measures of Central Tendency - UTRGV Faculty Webfaculty.utrgv.edu/lakshmi.roychowdhury/SPRING, 2017... ·  · 2017-02-07Sec 3.1 Measures of Central Tendency ... The population

(d) Determine the percentage of all patients that have serum HDL between 34 and 69.1 according to the Empirical Rule. (e) Determine the actual percentage of patients that have serum HDL between 34 and 69.1.

Chebyshev’s Inequality:

For any data set or distribution(regardless of shape of the distribution), at least �1 � ���� 100%

of the observations lie within k standard deviations of the mean, where k is any number greater than 1. That is, at least

�1 � ���� 100% of the data lie between μ – kσ and μ + kσ for k > 1.

Note: We can also use Chebyshev’s Inequality based on sample data. Sec 3.3 Measures of Central Tendency and Dispersion from Grouped Data

Approximate the Mean of a Variable from a Frequency Distribution

Population Mean Sample Mean

where xi is the midpoint or value of the i

th class

fi is the frequency of the i

th class

n is the number of classes

Approximate the Standard Deviation of a Variable from a Frequency Distribution

Population Sample Standard Deviation

Standard Deviation

where xi is the midpoint or value of the i

th class

fi is the frequency of the i

th class

Example: The National Survey of Student Engagement is a survey that (among other things) asked first year students at liberal arts colleges how much time they spend preparing for class each week. The results from the 2007 survey are summarized below. Approximate the standard deviation number of hours spent preparing for class each week.

Page 6: Sec 3.1 Measures of Central Tendency - UTRGV Faculty Webfaculty.utrgv.edu/lakshmi.roychowdhury/SPRING, 2017... ·  · 2017-02-07Sec 3.1 Measures of Central Tendency ... The population

The weighted mean, , of a variable is found by multiplying each value of the variable by its

corresponding weight, adding these products, and dividing this sum by the sum of the weights.

where w is the weight of the ith

observation

xi is the value of the i

th observation.

Example: Bob goes to the “Buy the Weigh” Nut store and creates his own bridge mix. He combines 1 pound of raisins, 2 pounds of chocolate covered peanuts, and 1.5 pounds of cashews. The raisins cost $1.25 per pound, the chocolate covered peanuts cost $3.25 per pound, and the cashews cost $5.40 per pound. What is the cost per pound of this mix.

Sec 3.4 Measures of Position

The z-score represents the distance that a data value is from the mean in terms of the number of

standard deviations.

Population z-score Sample z-score

The z-score is unitless. It has mean 0 and standard deviation 1.

Example: The mean height of males 20 years or older is 69.1 inches with a standard deviation of 2.8 inches. The mean height of females 20 years or older is 63.7 inches with a standard deviation of 2.7 inches. Data is based on information obtained from National Health and Examination Survey. Who is relatively taller?

Page 7: Sec 3.1 Measures of Central Tendency - UTRGV Faculty Webfaculty.utrgv.edu/lakshmi.roychowdhury/SPRING, 2017... ·  · 2017-02-07Sec 3.1 Measures of Central Tendency ... The population

Kevin Garnett’s height is 4.96 standard deviations above the mean. Candace Parker’s height is 4.56 standard deviations above the mean. Kevin Garnett is relatively taller.

The kth percentile, denoted, Pk , of a set of data is a value such that k percent of the

observations are less than or equal to the value.

Example: Interpret the percentile

The Graduate Record Examination (GRE) is a test required for admission to many U.S. graduate schools. The University of Pittsburgh Graduate School of Public Health requires a GRE score no less than the 70th percentile for admission into their Human Genetics MPH or MS program. Interpret this admissions requirement

In general, the 70th percentile is the score such that 70% of the individuals who took the exam scored worse, and 30% of the individuals scores better. In order to be admitted to this program, an applicant must score as high or higher than 70% of the people who take the GRE. Put another way, the individual’s score must be in the top 30%.

Quartiles divide data sets into fourths, or four equal parts.

• The 1

st

quartile, denoted Q1, divides the bottom 25% the data from the top 75%.

Therefore, the 1

st

quartile is equivalent to the 25

th

percentile.

• The 2

nd

quartile divides the bottom 50% of the data from the top 50% of the data, so

that the 2

nd

quartile is equivalent to the 50

th

percentile, which is equivalent to the

median.

• The 3

rd

quartile divides the bottom 75% of the data from the top 25% of the data, so that

the 3

rd

quartile is equivalent to the 75

th

percentile.

Page 8: Sec 3.1 Measures of Central Tendency - UTRGV Faculty Webfaculty.utrgv.edu/lakshmi.roychowdhury/SPRING, 2017... ·  · 2017-02-07Sec 3.1 Measures of Central Tendency ... The population

Example: A group of Brigham Young University—Idaho students (Matthew Herring, Nathan

Spencer, Mark Walker, and Mark Steiner) collected data on the speed of vehicles traveling

through a construction zone on a state highway, where the posted speed was 25 mph. The

recorded speed of 14 randomly selected vehicles is given below:

20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40

Find and interpret the quartiles for speed in the construction zone.

Step 1: The data is already in ascending order.

Step 2: There are n = 14 observations, so the median, or second quartile, Q2, is the mean of the

7

th

and 8

th

observations. Therefore, M = 32.5.

Step 3: The median of the bottom half of the data is the first quartile, Q1.

20, 24, 27, 28, 29, 30, 32

The median of these seven observations is 28. Therefore, Q1 = 28. The median of the top half of

the data is the third quartile, Q3. Therefore, Q

3 = 38.

• 25% of the speeds are less than or equal to the first quartile, 28 miles per hour, and

75% of the speeds are greater than 28 miles per hour.

• 50% of the speeds are less than or equal to the second quartile, 32.5 miles per hour,

and 50% of the speeds are greater than 32.5 miles per hour.

• 75% of the speeds are less than or equal to the third quartile, 38 miles per hour, and

25% of the speeds are greater than 38 miles per hour.

The interquartile range, IQR, is the range of the middle 50% of the observations in a data set.

That is, the IQR is the difference between the third and first quartiles and is found using the

formula

IQR = Q3

– Q1

Example: A group of Brigham Young University—Idaho students (Matthew Herring, Nathan

Spencer, Mark Walker, and Mark Steiner) collected data on the speed of vehicles traveling

through a construction zone on a state highway, where the posted speed was 25 mph. The

recorded speed of 14 randomly selected vehicles is given below:

20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40

Determine and interpret the interquartile range of the speed data.

Page 9: Sec 3.1 Measures of Central Tendency - UTRGV Faculty Webfaculty.utrgv.edu/lakshmi.roychowdhury/SPRING, 2017... ·  · 2017-02-07Sec 3.1 Measures of Central Tendency ... The population

The range of the middle 50% of the speed of cars traveling through the construction zone is 10 miles per hour.

Suppose a 15

th

car travels through the construction zone at 100 miles per hour. How does this value impact the mean, median, standard deviation, and interquartile range?

Checking for Outliers by Using Quartiles

Step 1 Determine the first and third quartiles of the data.

Step 2 Compute the interquartile range.

Step 3 Determine the fences. Fences serve as cutoff points for determining outliers.

Lower Fence = Q1 – 1.5(IQR)

Upper Fence = Q3 + 1.5(IQR)

Page 10: Sec 3.1 Measures of Central Tendency - UTRGV Faculty Webfaculty.utrgv.edu/lakshmi.roychowdhury/SPRING, 2017... ·  · 2017-02-07Sec 3.1 Measures of Central Tendency ... The population

Step 4 If a data value is less than the lower fence or greater than the upper fence, it is

considered an outlier.

Sec 3.5 The Five-Number Summary and Boxplots

The five-number summary of a set of data consists of the smallest data value, Q1, the median,

Q3, and the largest data value. We organize the five-number summary as follows:

Drawing a Boxplot

Step 1: Determine the lower and upper fences.

Lower Fence = Q1 – 1.5(IQR)

Upper Fence = Q3 + 1.5(IQR)

where IQR = Q3 – Q

1

Step 2: Draw a number line long enough to include the all necessary values. Insert vertical lines

at Q1, M, and Q

3. Enclose these vertical lines in a box.

Step 3: Label the lower and upper fences.

Step 4: Draw a line from Q1 to the smallest data value that is larger than the lower

fence. Draw a line from Q3 to the largest data value that is smaller than the upper fence.

These values are called adjacent values and these lines are called whiskers.

Step 5: Any data values less than the lower fence or greater than the upper fence are

outliers and are marked with an asterisk (*).

Example: Every six months, the United States Federal Reserve Board conducts a survey of credit card plans in the U.S. The following data are the interest rates charged by 10 credit card issuers randomly selected for the July 2005 survey. Construct a boxplot of the data.

Page 11: Sec 3.1 Measures of Central Tendency - UTRGV Faculty Webfaculty.utrgv.edu/lakshmi.roychowdhury/SPRING, 2017... ·  · 2017-02-07Sec 3.1 Measures of Central Tendency ... The population

Step 1: The interquartile range (IQR) is 14.4% - 12% = 2.4%. The lower and upper fences are:

Lower Fence = Q1 – 1.5(IQR) = 12 – 1.5(2.4) = 8.4%

Upper Fence = Q3 + 1.5(IQR) = 14.4 + 1.5(2.4) = 18.0%

Step 2: Draw the box plot

Use a boxplot and quartiles to describe the shape of a distribution.

The interest rate boxplot indicates that the distribution is skewed left.