Stats chapter 1
-
Upload
richard-ferreria -
Category
Documents
-
view
2.154 -
download
2
description
Transcript of Stats chapter 1
Chapter 1Chapter 1
Exploring Data
1.1 DISPLAYING DATA WITH GRAPHS
Categorical variables
Bar graphs• Recall that horizontal axis is the
category name and the vertical axis is the count or percentage
Create a bar graph for “mobile phone carrier” for the students in this period in class
/start with a survey!
Categorical Variables
Pie Chart• the area of each slice of pie reflects the
relative frequency of the category the slice represents– i.e. if “ATT” is used by 25% of the class, the area
of the ATT slice must be 25% of the entire pie
• Remember/ all categories must be represented in the pie
Typically, these are not fun to create
Quantitative Data
Stemplot (a.k.a. “Stem and Leaf Plot”)A stemplot displays the distribution in a very meaningful way
Preview the example of pg 43!
Quantitative Data
Stemplot steps1. Arrange the observations numerical
order2. Separate each observation into a stem
and a leaf3. Write stems in a vertical column4. Write the leaf of each observation next
to the stem. Leaves that are closest to the stem are lower in numerical value.
Quantitative Data
The following measurements are the number of points scored by THS football in each game of the 2009 season.
42, 27, 19, 14, 20, 47, 53, 28, 32, 30, 44, 20
Quantitative Data
Stemplot steps1. Arrange the observations numerical
order
14, 19, 20, 20, 27, 28, 30, 32, 42, 44, 47, 53
Quantitative Data
Stemplot steps2. Separate each observation into a
stem and a leaf
1/4, 1/9, 2/0, 2/0, 2/7, 2/8, 3/0, 3/2, 4/2, 4/4, 4/7, 5/3
Quantitative Data
Stemplot steps3. Write stems in a vertical column
1/4, 1/9, 2/0, 2/0, 2/7, 2/8, 3/0, 3/2, 4/2, 4/4, 4/7, 5/3
1 2 3 45
Quantitative Data
4. Write the leaf of each observation next to the stem. Leaves that are closest to the stem are lower in numerical value.1/4, 1/9, 2/0, 2/0, 2/7, 2/8, 3/0, 3/2, 4/2, 4/4, 4/7, 5/3
1 4, 92 0, 0, 7, 83 0, 24 2, 4, 75 3
YAY!
Quantitative Data
Histogram• A histogram is similar to a bar graph,
but is used for quantitative data only.• Observations are separated into classes
(number ranges)– All classes must have equal width
• Like a bar graph, the height of each bar represents the count for each class
• Example 1.6 on pg 49
Quantitative Data
HistogramLet’s use the same data from our previous example
14, 19, 20, 20, 27, 28, 30, 32, 42, 44, 47, 53
Quantitative Data
Histogram1. Separate the range into classes of
equal widthLet’s try the following:00 < score < 1415 < score < 29 30 < score < 44 45 < score < 60
Quantitative Data
Histogram2.Count the number of individuals in
each class:Class Count
00 < score < 14 1
15 < score < 29 5
30 < score < 44 4
45 < score < 60 2
Quantitative Data
Histogram3. Draw and label each Axis:
COUNT
6
5
4
3
2
1
0 10 20 30 40 50 60
Number of points scored
Quantitative Data
Histogram3. Draw each bar to the correct height
COUNT
6
5
4
3
2
1
0 10 20 30 40 50 60 Number of points scored
Assignment 1A
• 1.1-1.12 all• Starts on pg 46
Examining Distributions
• Look for the pattern and any deviations from the general pattern
• In written work, you must describe C.U.S.S.– Center– Unusual features (outliers)– Shape– Spread
• Note: CUSS is just a mnemonic device. It is customary to discuss “unusual features” last
Examining Distributions
Center- We will discuss at greater length later. For now, you can use the median as a measure of centerSpread- Also discussed later. For now, give the minimum and maximum values to describe spread
Examining Distributions
Shape- We generally want to know two things1. How many peaks? Is it unimodal (one
distinct peak) or is it uniform (no distinct peaks)?
2. Is the distribution symmetric (both tails are approximately equal) or skewed (one of the tails is longer)Left skewed- left tail is longerRight skewed- right tail is longer
Examining Distributions
Outliers- like many things in statistics, outliers can be a judgment call. Although we will learn a customary formula, to determine outliers, to formula is arbitrary.
• In a histogram, outliers will be clearly separated from the rest of the observations
• Because class widths can be arbitrary, be sure to thoroughly examine the data before classifying an observation as an outlier.
• Do not ignore or delete outlier observations!
Relative Freq. and Cumulative Freq.
Let’s return to THS Football ‘09
Class Count
00 < score < 14 1
15 < score < 29 5
30 < score < 44 4
45 < score < 60 2
Relative Freq. and Cumulative Freq.
We will add a column to show relative frequency
Yes, “relative frequency” is the same thing as “percentage”At this point, you could make a histogram using relative frequencies, if desired.
Score Count Rel. Freq. (%)
00 to 14 1 8
15 to 29 5 42
30 to 44 4 33
45 to 60 2 17
Relative Freq. and Cumulative Freq.
Now add a column to show cumulative frequency
Yes, keep adding the next rel. freq.The last cell in the column should be 100, unless there is roundoff error (not a big deal)
Score Count Rel. Freq. (%)
Cum. Freq.
00 to 14 1 8 8
15 to 29 5 42 50
30 to 44 4 33 83
45 to 60 2 17 100
Relative Freq. and Cumulative Freq.
To create a “Cumulative Frequency Plot” or “Ogive” start by creating axes similar to a histogramThe vertical axis is percentage and should be labeled 0 to 100%
Cum
ula
tive
freq
. (%
)
100
80
60
40
20
0 10 20 30 40 50 60
Number of points scored
Relative Freq. and Cumulative Freq.
Plot points for each Cum. Freq. The left boundary of the first class should be plotted at zero. The last point plotted will be the right boundary of the last class at 100%
Cum
ula
tive
freq
. (%
)
100
80
60
40
20
0 10 20 30 40 50 60
Number of points scored
Relative Freq. and Cumulative Freq.
CONNECT THE DOTS!C
um
ula
tive
freq
. (%
)
100
80
60
40
20
0 10 20 30 40 50 60
Number of points scored
Relative Freq. and Cumulative Freq.
Some notes about ogives.• It’s pronounced “Oh-Jives”• Ogives can be used to find approx. percentile
rank– The vertical axis is percentile!
• In particular, we are interested in:– Median (50th percentile)– First Quartile (25th percentile)– Third Quartile (75th percentile)
The above vocab. Will come up again. Memorize it!
Assignment 1.1B
• P 64 #13-15, 21-25
1.2 DESCRIBING DATA WITH NUMB3RS
Measuring Center
• MEAN- calculated the same way you always calculate mean (average)
• The symbol is read as “x-bar”• The mean is affected by not a resistant
measure of center- it is sensitive to a few extreme observations.
1 2 ... n
i
x x xx
nx
xn
x
Measuring Center
• Median- the “middle” number in a set of observations is known as the median
• If the data set has an even number of observations, then the median is the average of the two middle numbers
• Unlike the mean, the median is a resistant measure of center.
Measuring Spread
The Quartiles• The median of the subset of data less than the
median is the First Quartile (Q1)
• The median of the subset of data greater than the median is the Third Quartile (Q3)
Notice that the median is not included in either of the above calculations Q1 is the 25th percentile
Q3 is the 75th percentile
Measuring Spread
Recall the data from THS Football 200914, 19, 20, 20, 27, 28, 30, 32, 42, 44,
47, 53
We can order the numbers to help01, 02, 03, 04, 05, 06, 07, 08, 09, 10,
11, 1214, 19, 20, 20, 27, 28, 30, 32, 42, 44,
47, 53
Measuring Spread
01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12
14, 19, 20, 20, 27, 28, 30, 32, 42, 44, 47, 53
Notice that the median is the average of 28 and 30Med. = 29
Measuring Spread
01, 02, 03, 04, 05, 06, Q1 is the avg14, 19, 20, 20, 27, 28, of 20 and 20
07, 08, 09, 10, 11, 12 Q3 is the avg.30, 32, 42, 44, 47, 53 of 42 and 44
Q1 = 20Med. = 29Q3 = 43
Measuring Spreat
InterQuartile Range (IQR)IQR is the preferred measurement of spread when the median is used to describe centerIQR = Q3 - Q1
IQR = 43 – 20
IQR = 23
Measuring Spread
InterQuartile Range and OutliersThe previously mentioned formula for determining outlier observations depends on IQRHigh outliers (outliers to the right) measurements greater than Q3 + 1.5 x IQRLow Outliers (outliers to the left) measurements less than: Q1 - 1.5 x IQR
Measuring Spread
InterQuartile Range and OutliersHigh outliers
greater than Q3 + 1.5 x IQR = 43 + 1.5 x 23
or any observation greater than 77.5Low Outliers
less than: Q1 - 1.5 x IQR = 20 – 1.5 x 23
or observations less than -14.5Clearly, THS had no outlier football scores in
2009!
Five Number Summary
A snapshot of a data distribution can be given with the 5 number summary:
Minimum, Q1, Median, Q3, Maximum
For our THS Football 2009, the five number summary is:
14, 20, 29, 43, 53
Five Number Summary
The 5 number summary is used to create a box plot (“box and whiskers” plot)
0 10 20 30 40 50 60
Min Q1 Med Q3 Max
Five Number Summary
BOX PLOT• a number line must be included with a
box plot• outliers appear as unconnected dots
0 10 20 30 40 50 60
Assignment 1C
• P74 #27-30, 32, 34, 37
The Standard Deviation
The preferred measure of spread when using mean as a measure of center is the related measurements of “variance” and “standard deviation”variance = s2
standard deviation = s
Yes, standard deviation is the square root of variance.
The Standard Deviation
Formulation of variance
Yes, take the square root to find the std. dev.
2 2 2
1 22
22
...
1
1
1
n
i
x x x x x xs
n
s x xn
The Standard Deviation
For the THS 2009 dataMean = 31.33s2 = [(14-31.33)2+(19-31.33)2+(20-31.33)2+
(20-31.33)2+(27-31.33)2+(28-31.33)2+(30-31.33)2+(32-31.33)2+(42-31.33)2+(44-31.33)2+(47-31.33)2+(53-31.33)2] / (12-1)
s2 = 1730.66 / 11s2 = 157.33
The Standard Deviation
• Notice that the number s2 = 157.33 doesn’t really have much to do with the data set!
• However we can see that s = 12.54 has some meaning in our data.
• With all data sets, “the majority” of observations are within the standard deviation of the meanMost data is btwn 31.33 - 12.54 and 31.33 +
12.54-or- Most data is btwn 18.79 and 43.87
Which measurements do I choose?
• Use “mean and standard deviation” when the data is reasonably symmetric with no outliers.
• Use “median and IQR” or 5 num. sum. in cases where the “mean and std. dev.” is not appropriate.
• Remember: “5 num sum” is resistant to outliers, while the “mean and std dev” is not resistant
Linear Transformation of Data
• If every member of a data set is multiplied by a positive number b, then the measures of center and spread are also multiplied by b.
• If a constant a is added to every member of a data set, then a is added to the measure center, but the measures of spread remain unchanged.
Linear Transformation of Data
Measurement OLD DATA TRANSFORMED DATA
Observation x a + b*x
Mean a + b*
Std. dev. s b*s
Median Med a + b*med
InterQuart. Range
IQR b*IQR
Comparing Data Sets
• The AP Exam always asks students to compare data.
• Clearly identify the populations that are being compared
• Make sure to compare each of CUSS • Make reference to the measurement you are
comparing– i.e. use “mean” and not “center”
• Give the values of the measurements you are comparing.
• Make use of comparison phrases “is greater than” “is less than”
Assignment 1D
• P89 #39-41, 45-47