Stats chapter 1

Chapter 1Chapter 1

Exploring Data

1.1 DISPLAYING DATA WITH GRAPHS

Categorical variables

Bar graphs• Recall that horizontal axis is the

category name and the vertical axis is the count or percentage

Create a bar graph for “mobile phone carrier” for the students in this period in class

/start with a survey!

Categorical Variables

Pie Chart• the area of each slice of pie reflects the

relative frequency of the category the slice represents– i.e. if “ATT” is used by 25% of the class, the area

of the ATT slice must be 25% of the entire pie

• Remember/ all categories must be represented in the pie

Typically, these are not fun to create

Quantitative Data

Stemplot (a.k.a. “Stem and Leaf Plot”)A stemplot displays the distribution in a very meaningful way

Preview the example of pg 43!

Quantitative Data

Stemplot steps1. Arrange the observations numerical

order2. Separate each observation into a stem

and a leaf3. Write stems in a vertical column4. Write the leaf of each observation next

to the stem. Leaves that are closest to the stem are lower in numerical value.

Quantitative Data

The following measurements are the number of points scored by THS football in each game of the 2009 season.

42, 27, 19, 14, 20, 47, 53, 28, 32, 30, 44, 20

Quantitative Data

Stemplot steps1. Arrange the observations numerical

order

14, 19, 20, 20, 27, 28, 30, 32, 42, 44, 47, 53

Quantitative Data

Stemplot steps2. Separate each observation into a

stem and a leaf

1/4, 1/9, 2/0, 2/0, 2/7, 2/8, 3/0, 3/2, 4/2, 4/4, 4/7, 5/3

Quantitative Data

Stemplot steps3. Write stems in a vertical column

1/4, 1/9, 2/0, 2/0, 2/7, 2/8, 3/0, 3/2, 4/2, 4/4, 4/7, 5/3

1 2 3 45

Quantitative Data

4. Write the leaf of each observation next to the stem. Leaves that are closest to the stem are lower in numerical value.1/4, 1/9, 2/0, 2/0, 2/7, 2/8, 3/0, 3/2, 4/2, 4/4, 4/7, 5/3

1 4, 92 0, 0, 7, 83 0, 24 2, 4, 75 3

YAY!

Quantitative Data

Histogram• A histogram is similar to a bar graph,

but is used for quantitative data only.• Observations are separated into classes

(number ranges)– All classes must have equal width

• Like a bar graph, the height of each bar represents the count for each class

• Example 1.6 on pg 49

Quantitative Data

HistogramLet’s use the same data from our previous example

14, 19, 20, 20, 27, 28, 30, 32, 42, 44, 47, 53

Quantitative Data

Histogram1. Separate the range into classes of

equal widthLet’s try the following:00 < score < 1415 < score < 29 30 < score < 44 45 < score < 60

Quantitative Data

Histogram2.Count the number of individuals in

each class:Class Count

00 < score < 14 1

15 < score < 29 5

30 < score < 44 4

45 < score < 60 2

Quantitative Data

Histogram3. Draw and label each Axis:

COUNT

6

5

4

3

2

1

0 10 20 30 40 50 60

Number of points scored

Quantitative Data

Histogram3. Draw each bar to the correct height

COUNT

6

5

4

3

2

1

0 10 20 30 40 50 60 Number of points scored

Assignment 1A

• 1.1-1.12 all• Starts on pg 46

Examining Distributions

• Look for the pattern and any deviations from the general pattern

• In written work, you must describe C.U.S.S.– Center– Unusual features (outliers)– Shape– Spread

• Note: CUSS is just a mnemonic device. It is customary to discuss “unusual features” last


Center- We will discuss at greater length later. For now, you can use the median as a measure of centerSpread- Also discussed later. For now, give the minimum and maximum values to describe spread


Shape- We generally want to know two things1. How many peaks? Is it unimodal (one

distinct peak) or is it uniform (no distinct peaks)?

2. Is the distribution symmetric (both tails are approximately equal) or skewed (one of the tails is longer)Left skewed- left tail is longerRight skewed- right tail is longer


Outliers- like many things in statistics, outliers can be a judgment call. Although we will learn a customary formula, to determine outliers, to formula is arbitrary.

• In a histogram, outliers will be clearly separated from the rest of the observations

• Because class widths can be arbitrary, be sure to thoroughly examine the data before classifying an observation as an outlier.

• Do not ignore or delete outlier observations!

Relative Freq. and Cumulative Freq.

Let’s return to THS Football ‘09

Class Count

00 < score < 14 1

15 < score < 29 5

30 < score < 44 4

45 < score < 60 2


We will add a column to show relative frequency

Yes, “relative frequency” is the same thing as “percentage”At this point, you could make a histogram using relative frequencies, if desired.

Score Count Rel. Freq. (%)

00 to 14 1 8

15 to 29 5 42

30 to 44 4 33

45 to 60 2 17


Now add a column to show cumulative frequency

Yes, keep adding the next rel. freq.The last cell in the column should be 100, unless there is roundoff error (not a big deal)

Score Count Rel. Freq. (%)

Cum. Freq.

00 to 14 1 8 8

15 to 29 5 42 50

30 to 44 4 33 83

45 to 60 2 17 100


To create a “Cumulative Frequency Plot” or “Ogive” start by creating axes similar to a histogramThe vertical axis is percentage and should be labeled 0 to 100%

Cum

ula

tive

freq

. (%

)

100

80

60

40

20

0 10 20 30 40 50 60



Plot points for each Cum. Freq. The left boundary of the first class should be plotted at zero. The last point plotted will be the right boundary of the last class at 100%

Cum

ula

tive

freq

. (%

)

100

80

60

40

20

0 10 20 30 40 50 60



CONNECT THE DOTS!C

um

ula

tive

freq

. (%

)

100

80

60

40

20

0 10 20 30 40 50 60



Some notes about ogives.• It’s pronounced “Oh-Jives”• Ogives can be used to find approx. percentile

rank– The vertical axis is percentile!

• In particular, we are interested in:– Median (50th percentile)– First Quartile (25th percentile)– Third Quartile (75th percentile)

The above vocab. Will come up again. Memorize it!

Assignment 1.1B

• P 64 #13-15, 21-25

1.2 DESCRIBING DATA WITH NUMB3RS

Measuring Center

• MEAN- calculated the same way you always calculate mean (average)

• The symbol is read as “x-bar”• The mean is affected by not a resistant

measure of center- it is sensitive to a few extreme observations.

1 2 ... n

i

x x xx

nx

xn

x

Measuring Center

• Median- the “middle” number in a set of observations is known as the median

• If the data set has an even number of observations, then the median is the average of the two middle numbers

• Unlike the mean, the median is a resistant measure of center.

Measuring Spread

The Quartiles• The median of the subset of data less than the

median is the First Quartile (Q1)

• The median of the subset of data greater than the median is the Third Quartile (Q3)

Notice that the median is not included in either of the above calculations Q1 is the 25th percentile

Q3 is the 75th percentile

Measuring Spread

Recall the data from THS Football 200914, 19, 20, 20, 27, 28, 30, 32, 42, 44,

47, 53

We can order the numbers to help01, 02, 03, 04, 05, 06, 07, 08, 09, 10,

11, 1214, 19, 20, 20, 27, 28, 30, 32, 42, 44,

47, 53

Measuring Spread

01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12

14, 19, 20, 20, 27, 28, 30, 32, 42, 44, 47, 53

Notice that the median is the average of 28 and 30Med. = 29

Measuring Spread

01, 02, 03, 04, 05, 06, Q1 is the avg14, 19, 20, 20, 27, 28, of 20 and 20

07, 08, 09, 10, 11, 12 Q3 is the avg.30, 32, 42, 44, 47, 53 of 42 and 44

Q1 = 20Med. = 29Q3 = 43

Measuring Spreat

InterQuartile Range (IQR)IQR is the preferred measurement of spread when the median is used to describe centerIQR = Q3 - Q1

IQR = 43 – 20

IQR = 23

Measuring Spread

InterQuartile Range and OutliersThe previously mentioned formula for determining outlier observations depends on IQRHigh outliers (outliers to the right) measurements greater than Q3 + 1.5 x IQRLow Outliers (outliers to the left) measurements less than: Q1 - 1.5 x IQR

Measuring Spread

InterQuartile Range and OutliersHigh outliers

greater than Q3 + 1.5 x IQR = 43 + 1.5 x 23

or any observation greater than 77.5Low Outliers

less than: Q1 - 1.5 x IQR = 20 – 1.5 x 23

or observations less than -14.5Clearly, THS had no outlier football scores in

2009!

Five Number Summary

A snapshot of a data distribution can be given with the 5 number summary:

Minimum, Q1, Median, Q3, Maximum

For our THS Football 2009, the five number summary is:

14, 20, 29, 43, 53

Five Number Summary

The 5 number summary is used to create a box plot (“box and whiskers” plot)

0 10 20 30 40 50 60

Min Q1 Med Q3 Max

Five Number Summary

BOX PLOT• a number line must be included with a

box plot• outliers appear as unconnected dots

0 10 20 30 40 50 60

Assignment 1C

• P74 #27-30, 32, 34, 37

The Standard Deviation

The preferred measure of spread when using mean as a measure of center is the related measurements of “variance” and “standard deviation”variance = s2

standard deviation = s

Yes, standard deviation is the square root of variance.


Formulation of variance

Yes, take the square root to find the std. dev.

2 2 2

1 22

22

...

1

1

1

n

i

x x x x x xs

n

s x xn


For the THS 2009 dataMean = 31.33s2 = [(14-31.33)2+(19-31.33)2+(20-31.33)2+

(20-31.33)2+(27-31.33)2+(28-31.33)2+(30-31.33)2+(32-31.33)2+(42-31.33)2+(44-31.33)2+(47-31.33)2+(53-31.33)2] / (12-1)

s2 = 1730.66 / 11s2 = 157.33


• Notice that the number s2 = 157.33 doesn’t really have much to do with the data set!

• However we can see that s = 12.54 has some meaning in our data.

• With all data sets, “the majority” of observations are within the standard deviation of the meanMost data is btwn 31.33 - 12.54 and 31.33 +

12.54-or- Most data is btwn 18.79 and 43.87

Which measurements do I choose?

• Use “mean and standard deviation” when the data is reasonably symmetric with no outliers.

• Use “median and IQR” or 5 num. sum. in cases where the “mean and std. dev.” is not appropriate.

• Remember: “5 num sum” is resistant to outliers, while the “mean and std dev” is not resistant

Linear Transformation of Data

• If every member of a data set is multiplied by a positive number b, then the measures of center and spread are also multiplied by b.

• If a constant a is added to every member of a data set, then a is added to the measure center, but the measures of spread remain unchanged.

Linear Transformation of Data

Measurement OLD DATA TRANSFORMED DATA

Observation x a + b*x

Mean a + b*

Std. dev. s b*s

Median Med a + b*med

InterQuart. Range

IQR b*IQR

Comparing Data Sets

• The AP Exam always asks students to compare data.

• Clearly identify the populations that are being compared

• Make sure to compare each of CUSS • Make reference to the measurement you are

comparing– i.e. use “mean” and not “center”

• Give the values of the measurements you are comparing.

• Make use of comparison phrases “is greater than” “is less than”

Assignment 1D

• P89 #39-41, 45-47

Stats chapter 1

Documents

Transcript of Stats chapter 1