CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

CHAPTERS 1 AND 2:DESCRIPTIVE STATISTICS

STAT 241/251

Outline Data: definitions and examples

Tabular and Graphical Summaries Categorical Quantitative

Data Distributions

Numerical Summaries Measures of Location Measures of Spread

Boxplots

Data transformations

2

Definitions and Summaries

Data3

Data – A Definition4

A Need for Organization5

Variables7

Categorical:

Nominal: Ordinal:

Quantitative:

Discrete: Continuous:

More on Variables and Data

8

Why should we care about the type of variable?

Caution: Categorical variables are often recorded using numbers (e.g. yes=1, no=0). Don’t mistake these for quantitative variables.

Univariate Data: Data on one variable. E.g. weight or age or time or…

Multivariate Data: Data on multiple variables. E.g. weight and age and time and…

Raw Data9

Four algorithms have been developed for cracking coded transmissions. Trials with each of the algorithms were run and the following data were collected

In all, 2800 trials were done. Thus the complete table is 3 by 2800. This does not lend itself well to drawing conclusions.

Trial

Algorithm

Time to Completion (sec)

Success

1 3 1.34 yes

2 1 3.45 yes

3 4 0.99 no

… … … …

Tables are Still Useful10

Tables can be used to summarize data.

There are two basic types of summary tables Frequency Table Relative Frequency Table

The definitions of each will change according to the type of data (Categorical or Quantitative).

For Categorical Data11

A Frequency Table is a table that displays the total number of cases falling into each category of a single categorical variable.

A Relative Frequency Table displays the percentage/proportion of cases rather than the number of cases.Success

Count

Yes 2520

No 280

Success

Percentage

Yes 90%

No 10%

Steps to making a Frequency Table for Quantitative Data12

1. Identify the smallest and largest observations to obtain the range of the values for the data.

2. Divide the (adjusted) range into equal sized non-overlapping bins.

3. Count the number of observations (frequency) in each bin.

4. Calculate the Relative Frequency for each bin. (optional)

Time to Completion13

Time (in seconds)

Count

0 – 0.99 780

1.00 – 1.99 1276

2.00-2.99 614

3.00-3.99 130

- The bins should be of equal length- It should be clear where each bin starts and ends.- Use 5-20 bins

Displaying Categorical Data

14

Bar Charts

Pie Charts

Side-by-Side Bar Charts

Side-by-Side Pie Charts

Pictures Speak Louder than Tables

15

Pie Charts16

Pie Charts present categories as slices of a circle, where the area of each slice is proportional to the total number of case in each category (or proportion)

Caution with Fancy 3-D Plots

17

Displaying Quantitative Data

18

Histograms: The quantitative equivalent to the bar chart. It is a graphical version of the frequency or relative frequency table. One difference with the bar chart is that there is no space between the bars.

Stem and Leaf plots (not covered in this course)

Box-plots

Histogram Example19

Consider the following ages:18, 45, 23, 34, 33, 39, 50, 19, 51, 68, 36, 26, 42, 49, 25, 37, 71

Distributions and Numerical Summaries

20

Data Distributions21

The definition of data distribution changes for categorical variables and quantitative variables, but both have the same goals: to characterize the behavior of the variable.

Categorical Variable: It’s distribution is the list of categories of the variable, along with the frequency of each.

Quantitative Variable: This can’t be achieved, so we need to describe other features.

Describing Distributionsfor Quantitative Data Shape

Center

Spread

The disadvantage here is that we don’t have as good a grasp of the data as we do with categorical variables, but the advantage is that we can work with these numerically.

22

Shape part 1 - Modes

Does the distribution (viewed using a histogram, say) have no humps, one hump or more than one hump?

We call the humps modes (the most popular value(s) the variable can take on).

A distribution with no modes is called uniform.

A distribution with one mode is called unimodal.

A distribution with two modes is called bimodal.

A distribution with many modes is called multimodal.

23

Shape part 2 - Symmetry

If we cut the distribution at the center and find an approximately mirror image on both sides, the distribution is sad to be symmetric.

The ends of the distribution are known as its tails.

24

Skewed

If the distribution is not symmetric, then we say that it is skewed.

We say it is skewed in the direction of the longer tail. Thus is can be left/negatively skewed or right/positively skewed.

25

Shape part 3 - Outliers

An Outlier is an observation that is quite far from the ‘body’ of the distribution.

They can cause problems with just about every method we will discuss in this course, so they must be identified.

In some cases, outliers are removed, but this must be done with great caution.

If an outlier is to be removed, it should be mentioned in any subsequent conclusion/discussion.

26

Numerical Summaries27

The center and spread of the data are described numerically using summary statistics.

These try to communicate as much as possible with regards to the data

The shape of the distribution plays an important role in the choice of summary statistics.

Center 1 – Median

The median is the middle observation of the ordered list.

Calculating the median Order the data (usually from lowest to highest)

If there are an odd number of observations, select the middle observation

If there are an even number of observations, take the average of the two middle observations.

E.g. if there are 7 observations, take the 4th , if there are 8 observations, take the average of observations 4 and 5.

28

Center 2 – Mean

The mean is simply the average of the observations.

We’ll use the mean extensively in this course.

Let yi be the ith observation of variable y.

Let n be the number of observationsThen we denote the using and calculate it using:

29

More Mean

The mean summarizes the center well if the distribution is symmetric, unimodal and there are no outliers.

Otherwise the median is a better choice.

At this point, it may appear that the median is the natural choice to calculate the center, but the mean is often favoured. The reason is somewhat beyond our scope for the moment.

30

An exercise31

Consider two small data sets.

1, 3, 7, 2, 12, 4, 8, 5, 8

1, 3, 7, 2, 12, 4, 8, 5, 8, 120

Calculate the mean and median

Spread 1 – Interquartile Range The IQR is the range of the middle 50% of the data.

The first quartile (Q1) is the value with 25% of the ordered data below it.

The third quartile (Q3) is the value with 75% of the ordered values below it.

IQR= Q3- Q1 It is a number, not an interval.

32

Spread 2 – Variance and SD The Variance is the ‘average’ of the squared differences (or deviation) from the mean.

The Standard Deviation is the square root of the variance

33

Properties of SD and Variance They cannot be negative The SD has the same units as the observations/mean

They are zero iff all observations have the same value. (i.e. there is no spread!)

The larger they are the more spread out the data are.

So why divide by n-1? Why not just take the average?

34

What we should retain

The median and IQR are good measures of center and spread even when the distribution is skewed or has outliers.

The mean and variance are good measure of center and spread when the distribution is symmetric without outliers, but not multimodal.

When the data are symmetric without outliers and isn’t multimodal, we typically only report the mean and SD.

If the distribution is multimodal, then summary statistics are not appropriate.

35

Boxplots36

Box-plot37

Q3 – 3rd Quartile

Median

Q1 – First Quartile

Upper and Lower Whiskers

Box-plot

The Box-plot is a visual display of the 5-number summary.

It is useful for comparing two ore more distributions.

38

Constructing a Boxplot39

Step 1 – the Box: Identify the Median, 1st and 3rd Quartiles and complete the box.

Step 2 – the Fences: Fences are only used for construction purposes. The fences are 1.5xIQR away from each Quartile.

Step 3 – the Whiskers: Extend a line from each quartile to the most extreme observation within the fences. At these points, extend the whiskers.

Step 4 – Outliers: Any point outside the fences should be drawn in as points. These are potential outliers.

Lifetime of Pacemakers40

Replacing a pacemaker is a big deal. Data were collected on pacemaker lifetimes (in years). Here are the raw data:

12.3, 11.7, 11.5, 9.2, 1.2, 13.4, 12.9, 20.4, 11.1, 15.5, 12.4, 10.4, 10.7, 10.2

Summary Statistics: Median = 11.6Q1 = 10.4Q3 = 12.9

Draw an appropriate boxplot.

Boxplot Questions41

When are box-plots inappropriate?

When do we favor box-plots over histograms?

Explain why a point identified as an outlier by a box-plot may not be an outlier.

Linear Transformations!

42

Purpose43

There are many situations which lead to the need to linearly transform data. E.g. Your firm sends some temperature readings to an American firm, so it transforms the readings in degrees Celsius to degrees Fahrenheit.

When we transform the data, what happens to summary statistics?

What are you measuring?44

How the numerical measures are affected is dependent on what it is measuring: location or spread.

We create three classes of numerical summaries which are affected differently by transformations Measures of location: Mean, Median, Midrange, Quartiles, Percentiles, Min and Max

Measures of Spread: Standard Deviation, IQR, Range

Variance

Measures of Location45

These are affected by both adding/subtracting and multiplying/dividing

Let m be the current measure of location, then Adding the constant a will result in the new measure: m’ = m + a

Multiplying by b will lead to: m’ = bm Using the linear function f(x)=a+bx will lead to: m’ = a + bm

Measures of Spread46

These are affected by multiplying/dividing, but not by adding/subtracting

Let m be the current measure of spread, then Adding the constant a will result in the new measure: m’ = m

Multiplying by b will lead to: m’ = bm Using the linear function f(x)=a+bx will lead to: m’ = bm

Variance47

Same as measures of spread, but the effect is different.

Let v be the variance, then Adding the constant a will result in the new measure: v’ = v

Multiplying by b will lead to: v’ = b2v

Using the linear function f(x)=a+bx will lead to: v’ = b2v

Questions48

The average weight of watermelons from a farm is 4.3 kg with a SD of 1.5 kg. What are the mean and SD weight in lbs?

The median and IQR on a final exam are 50 and 22. The instructor decides to multiply the results1.13 and add 5 to each grade. What are the summary statistics now?

CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Documents

Transcript of CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.