CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.
-
Upload
carmel-stevens -
Category
Documents
-
view
230 -
download
1
Transcript of CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.
CHAPTERS 1 AND 2:DESCRIPTIVE STATISTICS
STAT 241/251
Outline Data: definitions and examples
Tabular and Graphical Summaries Categorical Quantitative
Data Distributions
Numerical Summaries Measures of Location Measures of Spread
Boxplots
Data transformations
2
Definitions and Summaries
Data3
Data – A Definition4
A Need for Organization5
6
Variables7
Categorical:
Nominal: Ordinal:
Quantitative:
Discrete: Continuous:
More on Variables and Data
8
Why should we care about the type of variable?
Caution: Categorical variables are often recorded using numbers (e.g. yes=1, no=0). Don’t mistake these for quantitative variables.
Univariate Data: Data on one variable. E.g. weight or age or time or…
Multivariate Data: Data on multiple variables. E.g. weight and age and time and…
Raw Data9
Four algorithms have been developed for cracking coded transmissions. Trials with each of the algorithms were run and the following data were collected
In all, 2800 trials were done. Thus the complete table is 3 by 2800. This does not lend itself well to drawing conclusions.
Trial
Algorithm
Time to Completion (sec)
Success
1 3 1.34 yes
2 1 3.45 yes
3 4 0.99 no
… … … …
Tables are Still Useful10
Tables can be used to summarize data.
There are two basic types of summary tables Frequency Table Relative Frequency Table
The definitions of each will change according to the type of data (Categorical or Quantitative).
For Categorical Data11
A Frequency Table is a table that displays the total number of cases falling into each category of a single categorical variable.
A Relative Frequency Table displays the percentage/proportion of cases rather than the number of cases.Success
Count
Yes 2520
No 280
Success
Percentage
Yes 90%
No 10%
Steps to making a Frequency Table for Quantitative Data12
1. Identify the smallest and largest observations to obtain the range of the values for the data.
2. Divide the (adjusted) range into equal sized non-overlapping bins.
3. Count the number of observations (frequency) in each bin.
4. Calculate the Relative Frequency for each bin. (optional)
Time to Completion13
Time (in seconds)
Count
0 – 0.99 780
1.00 – 1.99 1276
2.00-2.99 614
3.00-3.99 130
- The bins should be of equal length- It should be clear where each bin starts and ends.- Use 5-20 bins
Displaying Categorical Data
14
Bar Charts
Pie Charts
Side-by-Side Bar Charts
Side-by-Side Pie Charts
Pictures Speak Louder than Tables
15
Pie Charts16
Pie Charts present categories as slices of a circle, where the area of each slice is proportional to the total number of case in each category (or proportion)
Caution with Fancy 3-D Plots
17
Displaying Quantitative Data
18
Histograms: The quantitative equivalent to the bar chart. It is a graphical version of the frequency or relative frequency table. One difference with the bar chart is that there is no space between the bars.
Stem and Leaf plots (not covered in this course)
Box-plots
Histogram Example19
Consider the following ages:18, 45, 23, 34, 33, 39, 50, 19, 51, 68, 36, 26, 42, 49, 25, 37, 71
Distributions and Numerical Summaries
20
Data Distributions21
The definition of data distribution changes for categorical variables and quantitative variables, but both have the same goals: to characterize the behavior of the variable.
Categorical Variable: It’s distribution is the list of categories of the variable, along with the frequency of each.
Quantitative Variable: This can’t be achieved, so we need to describe other features.
Describing Distributionsfor Quantitative Data Shape
Center
Spread
The disadvantage here is that we don’t have as good a grasp of the data as we do with categorical variables, but the advantage is that we can work with these numerically.
22
Shape part 1 - Modes
Does the distribution (viewed using a histogram, say) have no humps, one hump or more than one hump?
We call the humps modes (the most popular value(s) the variable can take on).
A distribution with no modes is called uniform.
A distribution with one mode is called unimodal.
A distribution with two modes is called bimodal.
A distribution with many modes is called multimodal.
23
Shape part 2 - Symmetry
If we cut the distribution at the center and find an approximately mirror image on both sides, the distribution is sad to be symmetric.
The ends of the distribution are known as its tails.
24
Skewed
If the distribution is not symmetric, then we say that it is skewed.
We say it is skewed in the direction of the longer tail. Thus is can be left/negatively skewed or right/positively skewed.
25
Shape part 3 - Outliers
An Outlier is an observation that is quite far from the ‘body’ of the distribution.
They can cause problems with just about every method we will discuss in this course, so they must be identified.
In some cases, outliers are removed, but this must be done with great caution.
If an outlier is to be removed, it should be mentioned in any subsequent conclusion/discussion.
26
Numerical Summaries27
The center and spread of the data are described numerically using summary statistics.
These try to communicate as much as possible with regards to the data
The shape of the distribution plays an important role in the choice of summary statistics.
Center 1 – Median
The median is the middle observation of the ordered list.
Calculating the median Order the data (usually from lowest to highest)
If there are an odd number of observations, select the middle observation
If there are an even number of observations, take the average of the two middle observations.
E.g. if there are 7 observations, take the 4th , if there are 8 observations, take the average of observations 4 and 5.
28
Center 2 – Mean
The mean is simply the average of the observations.
We’ll use the mean extensively in this course.
Let yi be the ith observation of variable y.
Let n be the number of observationsThen we denote the using and calculate it using:
29
More Mean
The mean summarizes the center well if the distribution is symmetric, unimodal and there are no outliers.
Otherwise the median is a better choice.
At this point, it may appear that the median is the natural choice to calculate the center, but the mean is often favoured. The reason is somewhat beyond our scope for the moment.
30
An exercise31
Consider two small data sets.
1, 3, 7, 2, 12, 4, 8, 5, 8
1, 3, 7, 2, 12, 4, 8, 5, 8, 120
Calculate the mean and median
Spread 1 – Interquartile Range The IQR is the range of the middle 50% of the data.
The first quartile (Q1) is the value with 25% of the ordered data below it.
The third quartile (Q3) is the value with 75% of the ordered values below it.
IQR= Q3- Q1 It is a number, not an interval.
32
Spread 2 – Variance and SD The Variance is the ‘average’ of the squared differences (or deviation) from the mean.
The Standard Deviation is the square root of the variance
33
Properties of SD and Variance They cannot be negative The SD has the same units as the observations/mean
They are zero iff all observations have the same value. (i.e. there is no spread!)
The larger they are the more spread out the data are.
So why divide by n-1? Why not just take the average?
34
What we should retain
The median and IQR are good measures of center and spread even when the distribution is skewed or has outliers.
The mean and variance are good measure of center and spread when the distribution is symmetric without outliers, but not multimodal.
When the data are symmetric without outliers and isn’t multimodal, we typically only report the mean and SD.
If the distribution is multimodal, then summary statistics are not appropriate.
35
Boxplots36
Box-plot37
Q3 – 3rd Quartile
Median
Q1 – First Quartile
Upper and Lower Whiskers
Box-plot
The Box-plot is a visual display of the 5-number summary.
It is useful for comparing two ore more distributions.
38
Constructing a Boxplot39
Step 1 – the Box: Identify the Median, 1st and 3rd Quartiles and complete the box.
Step 2 – the Fences: Fences are only used for construction purposes. The fences are 1.5xIQR away from each Quartile.
Step 3 – the Whiskers: Extend a line from each quartile to the most extreme observation within the fences. At these points, extend the whiskers.
Step 4 – Outliers: Any point outside the fences should be drawn in as points. These are potential outliers.
Lifetime of Pacemakers40
Replacing a pacemaker is a big deal. Data were collected on pacemaker lifetimes (in years). Here are the raw data:
12.3, 11.7, 11.5, 9.2, 1.2, 13.4, 12.9, 20.4, 11.1, 15.5, 12.4, 10.4, 10.7, 10.2
Summary Statistics: Median = 11.6Q1 = 10.4Q3 = 12.9
Draw an appropriate boxplot.
Boxplot Questions41
When are box-plots inappropriate?
When do we favor box-plots over histograms?
Explain why a point identified as an outlier by a box-plot may not be an outlier.
Linear Transformations!
42
Purpose43
There are many situations which lead to the need to linearly transform data. E.g. Your firm sends some temperature readings to an American firm, so it transforms the readings in degrees Celsius to degrees Fahrenheit.
When we transform the data, what happens to summary statistics?
What are you measuring?44
How the numerical measures are affected is dependent on what it is measuring: location or spread.
We create three classes of numerical summaries which are affected differently by transformations Measures of location: Mean, Median, Midrange, Quartiles, Percentiles, Min and Max
Measures of Spread: Standard Deviation, IQR, Range
Variance
Measures of Location45
These are affected by both adding/subtracting and multiplying/dividing
Let m be the current measure of location, then Adding the constant a will result in the new measure: m’ = m + a
Multiplying by b will lead to: m’ = bm Using the linear function f(x)=a+bx will lead to: m’ = a + bm
Measures of Spread46
These are affected by multiplying/dividing, but not by adding/subtracting
Let m be the current measure of spread, then Adding the constant a will result in the new measure: m’ = m
Multiplying by b will lead to: m’ = bm Using the linear function f(x)=a+bx will lead to: m’ = bm
Variance47
Same as measures of spread, but the effect is different.
Let v be the variance, then Adding the constant a will result in the new measure: v’ = v
Multiplying by b will lead to: v’ = b2v
Using the linear function f(x)=a+bx will lead to: v’ = b2v
Questions48
The average weight of watermelons from a farm is 4.3 kg with a SD of 1.5 kg. What are the mean and SD weight in lbs?
The median and IQR on a final exam are 50 and 22. The instructor decides to multiply the results1.13 and add 5 to each grade. What are the summary statistics now?