CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

48
CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251

Transcript of CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Page 1: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

CHAPTERS 1 AND 2:DESCRIPTIVE STATISTICS

STAT 241/251

Page 2: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Outline Data: definitions and examples

Tabular and Graphical Summaries Categorical Quantitative

Data Distributions

Numerical Summaries Measures of Location Measures of Spread

Boxplots

Data transformations

2

Page 3: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Definitions and Summaries

Data3

Page 4: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Data – A Definition4

Page 5: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

A Need for Organization5

Page 6: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

6

Page 7: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Variables7

Categorical:

Nominal: Ordinal:

Quantitative:

Discrete: Continuous:

Page 8: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

More on Variables and Data

8

Why should we care about the type of variable?

Caution: Categorical variables are often recorded using numbers (e.g. yes=1, no=0). Don’t mistake these for quantitative variables.

Univariate Data: Data on one variable. E.g. weight or age or time or…

Multivariate Data: Data on multiple variables. E.g. weight and age and time and…

Page 9: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Raw Data9

Four algorithms have been developed for cracking coded transmissions. Trials with each of the algorithms were run and the following data were collected

In all, 2800 trials were done. Thus the complete table is 3 by 2800. This does not lend itself well to drawing conclusions.

Trial

Algorithm

Time to Completion (sec)

Success

1 3 1.34 yes

2 1 3.45 yes

3 4 0.99 no

… … … …

Page 10: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Tables are Still Useful10

Tables can be used to summarize data.

There are two basic types of summary tables Frequency Table Relative Frequency Table

The definitions of each will change according to the type of data (Categorical or Quantitative).

Page 11: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

For Categorical Data11

A Frequency Table is a table that displays the total number of cases falling into each category of a single categorical variable.

A Relative Frequency Table displays the percentage/proportion of cases rather than the number of cases.Success

Count

Yes 2520

No 280

Success

Percentage

Yes 90%

No 10%

Page 12: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Steps to making a Frequency Table for Quantitative Data12

1. Identify the smallest and largest observations to obtain the range of the values for the data.

2. Divide the (adjusted) range into equal sized non-overlapping bins.

3. Count the number of observations (frequency) in each bin.

4. Calculate the Relative Frequency for each bin. (optional)

Page 13: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Time to Completion13

Time (in seconds)

Count

0 – 0.99 780

1.00 – 1.99 1276

2.00-2.99 614

3.00-3.99 130

- The bins should be of equal length- It should be clear where each bin starts and ends.- Use 5-20 bins

Page 14: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Displaying Categorical Data

14

Bar Charts

Pie Charts

Side-by-Side Bar Charts

Side-by-Side Pie Charts

Page 15: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Pictures Speak Louder than Tables

15

Page 16: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Pie Charts16

Pie Charts present categories as slices of a circle, where the area of each slice is proportional to the total number of case in each category (or proportion)

Page 17: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Caution with Fancy 3-D Plots

17

Page 18: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Displaying Quantitative Data

18

Histograms: The quantitative equivalent to the bar chart. It is a graphical version of the frequency or relative frequency table. One difference with the bar chart is that there is no space between the bars.

Stem and Leaf plots (not covered in this course)

Box-plots

Page 19: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Histogram Example19

Consider the following ages:18, 45, 23, 34, 33, 39, 50, 19, 51, 68, 36, 26, 42, 49, 25, 37, 71

Page 20: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Distributions and Numerical Summaries

20

Page 21: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Data Distributions21

The definition of data distribution changes for categorical variables and quantitative variables, but both have the same goals: to characterize the behavior of the variable.

Categorical Variable: It’s distribution is the list of categories of the variable, along with the frequency of each.

Quantitative Variable: This can’t be achieved, so we need to describe other features.

Page 22: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Describing Distributionsfor Quantitative Data Shape

Center

Spread

The disadvantage here is that we don’t have as good a grasp of the data as we do with categorical variables, but the advantage is that we can work with these numerically.

22

Page 23: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Shape part 1 - Modes

Does the distribution (viewed using a histogram, say) have no humps, one hump or more than one hump?

We call the humps modes (the most popular value(s) the variable can take on).

A distribution with no modes is called uniform.

A distribution with one mode is called unimodal.

A distribution with two modes is called bimodal.

A distribution with many modes is called multimodal.

23

Page 24: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Shape part 2 - Symmetry

If we cut the distribution at the center and find an approximately mirror image on both sides, the distribution is sad to be symmetric.

The ends of the distribution are known as its tails.

24

Page 25: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Skewed

If the distribution is not symmetric, then we say that it is skewed.

We say it is skewed in the direction of the longer tail. Thus is can be left/negatively skewed or right/positively skewed.

25

Page 26: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Shape part 3 - Outliers

An Outlier is an observation that is quite far from the ‘body’ of the distribution.

They can cause problems with just about every method we will discuss in this course, so they must be identified.

In some cases, outliers are removed, but this must be done with great caution.

If an outlier is to be removed, it should be mentioned in any subsequent conclusion/discussion.

26

Page 27: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Numerical Summaries27

The center and spread of the data are described numerically using summary statistics.

These try to communicate as much as possible with regards to the data

The shape of the distribution plays an important role in the choice of summary statistics.

Page 28: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Center 1 – Median

The median is the middle observation of the ordered list.

Calculating the median Order the data (usually from lowest to highest)

If there are an odd number of observations, select the middle observation

If there are an even number of observations, take the average of the two middle observations.

E.g. if there are 7 observations, take the 4th , if there are 8 observations, take the average of observations 4 and 5.

28

Page 29: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Center 2 – Mean

The mean is simply the average of the observations.

We’ll use the mean extensively in this course.

Let yi be the ith observation of variable y.

Let n be the number of observationsThen we denote the using and calculate it using:

29

Page 30: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

More Mean

The mean summarizes the center well if the distribution is symmetric, unimodal and there are no outliers.

Otherwise the median is a better choice.

At this point, it may appear that the median is the natural choice to calculate the center, but the mean is often favoured. The reason is somewhat beyond our scope for the moment.

30

Page 31: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

An exercise31

Consider two small data sets.

1, 3, 7, 2, 12, 4, 8, 5, 8

1, 3, 7, 2, 12, 4, 8, 5, 8, 120

Calculate the mean and median

Page 32: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Spread 1 – Interquartile Range The IQR is the range of the middle 50% of the data.

The first quartile (Q1) is the value with 25% of the ordered data below it.

The third quartile (Q3) is the value with 75% of the ordered values below it.

IQR= Q3- Q1 It is a number, not an interval.

32

Page 33: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Spread 2 – Variance and SD The Variance is the ‘average’ of the squared differences (or deviation) from the mean.

The Standard Deviation is the square root of the variance

33

Page 34: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Properties of SD and Variance They cannot be negative The SD has the same units as the observations/mean

They are zero iff all observations have the same value. (i.e. there is no spread!)

The larger they are the more spread out the data are.

So why divide by n-1? Why not just take the average?

34

Page 35: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

What we should retain

The median and IQR are good measures of center and spread even when the distribution is skewed or has outliers.

The mean and variance are good measure of center and spread when the distribution is symmetric without outliers, but not multimodal.

When the data are symmetric without outliers and isn’t multimodal, we typically only report the mean and SD.

If the distribution is multimodal, then summary statistics are not appropriate.

35

Page 36: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Boxplots36

Page 37: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Box-plot37

Q3 – 3rd Quartile

Median

Q1 – First Quartile

Upper and Lower Whiskers

Page 38: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Box-plot

The Box-plot is a visual display of the 5-number summary.

It is useful for comparing two ore more distributions.

38

Page 39: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Constructing a Boxplot39

Step 1 – the Box: Identify the Median, 1st and 3rd Quartiles and complete the box.

Step 2 – the Fences: Fences are only used for construction purposes. The fences are 1.5xIQR away from each Quartile.

Step 3 – the Whiskers: Extend a line from each quartile to the most extreme observation within the fences. At these points, extend the whiskers.

Step 4 – Outliers: Any point outside the fences should be drawn in as points. These are potential outliers.

Page 40: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Lifetime of Pacemakers40

Replacing a pacemaker is a big deal. Data were collected on pacemaker lifetimes (in years). Here are the raw data:

12.3, 11.7, 11.5, 9.2, 1.2, 13.4, 12.9, 20.4, 11.1, 15.5, 12.4, 10.4, 10.7, 10.2

Summary Statistics: Median = 11.6Q1 = 10.4Q3 = 12.9

Draw an appropriate boxplot.

Page 41: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Boxplot Questions41

When are box-plots inappropriate?

When do we favor box-plots over histograms?

Explain why a point identified as an outlier by a box-plot may not be an outlier.

Page 42: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Linear Transformations!

42

Page 43: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Purpose43

There are many situations which lead to the need to linearly transform data. E.g. Your firm sends some temperature readings to an American firm, so it transforms the readings in degrees Celsius to degrees Fahrenheit.

When we transform the data, what happens to summary statistics?

Page 44: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

What are you measuring?44

How the numerical measures are affected is dependent on what it is measuring: location or spread.

We create three classes of numerical summaries which are affected differently by transformations Measures of location: Mean, Median, Midrange, Quartiles, Percentiles, Min and Max

Measures of Spread: Standard Deviation, IQR, Range

Variance

Page 45: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Measures of Location45

These are affected by both adding/subtracting and multiplying/dividing

Let m be the current measure of location, then Adding the constant a will result in the new measure: m’ = m + a

Multiplying by b will lead to: m’ = bm Using the linear function f(x)=a+bx will lead to: m’ = a + bm

Page 46: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Measures of Spread46

These are affected by multiplying/dividing, but not by adding/subtracting

Let m be the current measure of spread, then Adding the constant a will result in the new measure: m’ = m

Multiplying by b will lead to: m’ = bm Using the linear function f(x)=a+bx will lead to: m’ = bm

Page 47: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Variance47

Same as measures of spread, but the effect is different.

Let v be the variance, then Adding the constant a will result in the new measure: v’ = v

Multiplying by b will lead to: v’ = b2v

Using the linear function f(x)=a+bx will lead to: v’ = b2v

Page 48: CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Questions48

The average weight of watermelons from a farm is 4.3 kg with a SD of 1.5 kg. What are the mean and SD weight in lbs?

The median and IQR on a final exam are 50 and 22. The instructor decides to multiply the results1.13 and add 5 to each grade. What are the summary statistics now?