Describing Data: One Variable

51
Statistics: Unlocking the Power of Data STAT 101 Dr. Kari Lock Morgan 9/6/12 Describing Data: One Variable SECTIONS 2.1, 2.2, 2.3, 2.4 One categorical variable (2.1) One quantitative variable (2.2, 2.3, 2.4)

description

STAT 101 Dr. Kari Lock Morgan 9/6/12. Describing Data: One Variable. SECTIONS 2.1, 2.2, 2.3, 2.4 One categorical variable (2.1) One quantitative variable (2.2, 2.3, 2.4). The Big Picture. Sample. Population. Sampling. Statistical Inference. Descriptive Statistics. - PowerPoint PPT Presentation

Transcript of Describing Data: One Variable

Page 1: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

STAT 101Dr. Kari Lock Morgan

9/6/12

Describing Data: One Variable

SECTIONS 2.1, 2.2, 2.3, 2.4• One categorical variable (2.1)• One quantitative variable (2.2, 2.3, 2.4)

Page 2: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

The Big Picture

Population

Sample

Sampling

Statistical Inference Descriptive

Statistics

Page 3: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Descriptive Statistics

In order to make sense of data, we need ways to summarize and visualize it

Summarizing and visualizing variables and relationships between two variables is often known as descriptive statistics (also known as exploratory data analysis)

Type of summary statistics and visualization methods depend on the type of variable(s) being analyzed (categorical or quantitative)

Page 4: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

One Categorical VariableA random sample of US adults in 2012 were surveyed regarding the type of cell phone owned

Android? iPhone? Blackberry? Non-smartphone? No cell phone?

Page 5: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Frequency Table

R: table(x)

•A frequency table shows the number of cases that fall in each category:

Android 458iPhone 437Blackberry 141Non Smartphone 924No cell phone 293Total 2253

Page 6: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Proportion

The proportion in a category is found by

Proportion for a sample: (“p-hat”)

Proportion for a population: p

Page 7: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

ProportionWhat proportion of adults sampled do not

own a cell phone?

Android 458iPhone 437Blackberry 141Non Smartphone 924No cell phone 293Total 2253

�̂�=2932253 =0.13

or 13%

Proportions and percentages can be used interchangeably

Page 8: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Relative Frequency TableA relative frequency table shows the proportion of cases that fall in each category

R: table(x)/length(x)

Android 0.203iPhone 0.194Blackberry 0.063Non Smartphone 0.410No cell phone 0.130

All the numbers in a relative frequency table sum to 1

Page 9: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Bar Chart/Plot/GraphIn a barplot, the height of the bar corresponds to the number of cases falling in each category

R: barchart(x)

Page 10: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Pie ChartIn a pie chart, the relative area of each slice of the pie corresponds to the proportion in each category

R: pie(table(x))

Page 11: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

StatKeywww.lock5stat.com/statkey

Page 12: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Summary: One Categorical Variable

Summary Statistics Proportion Frequency table Relative frequency table

Visualization Bar chart Pie chart

Page 13: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

One Quantitative Variable

World gross for all 2011 Hollywood movies

HollywoodMovies2011

More graphics on profits for Hollywood movies

Page 14: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

HollywoodMovies2011

Page 15: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

DotplotIn a dotplot, each case is represented by a dot and dots are stacked. Easy way to see each case

Page 16: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

HistogramThe height of the each bar corresponds to the number of cases within that range of the variable

R: hist(x)

Page 17: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Histogram vs Bar Chart

This is a

a) Histogramb) Bar chartc) Otherd) I have no idea

Page 18: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Histogram vs Bar Chart

This is a

a) Histogramb) Bar chartc) Otherd) I have no idea

Page 19: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Histogram vs Bar ChartA bar chart is for categorical data, and the x-axis has no numeric scale

A histogram is for quantitative data, and the x-axis is numeric

For a categorical variable, the number of bars equals the number of categories, and the number in each category is fixed

For a quantitative variable, the number of bars in a histogram is up to you (or your software), and the appearance can differ with different number of bars

Page 20: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Shape

Symmetric Left-SkewedRight-Skewed

Long right tail

Page 21: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

NotationThe sample size, the number of cases in the sample, is denoted by n

We often let x or y stand for any variable, and x1 , x2 , …, xn represent the n values of the variable x

x1 = 97.009, x2 = 201.897, x3 = 216.196, …

Page 22: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Mean

The mean or average of the data values is

Sample mean: Population mean: (“mu”)

𝑚𝑒𝑎𝑛=𝑥1+𝑥2+…+𝑥𝑛

𝑛 =∑ 𝑥𝑛

R: mean(x)

Page 23: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Median

The median, m, is the middle value when the data are ordered.

If there are an even number of values, the median is the average of the two middle values.

The median splits the data in half.

R: median(x)

Page 24: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

m = 76.66

=150.74Mean is “pulled” in the direction of skewness

Measures of Center

World Gross (in millions)

Page 25: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Skewness and Center

A distribution is left-skewed. Which measure of center would you expect to be higher?

a) Meanb) Median The mean will be

pulled down towards the skewness (towards the long tail).

Page 26: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Outlier

An outlier is an observed value that is notably distinct from the other

values in a dataset.

Page 27: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Outliers

World Gross (in millions)

Harry Potter

TransformersPirates of the Caribbean

Page 28: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Resistance

A statistic is resistant if it is relatively unaffected by extreme

values.

The median is resistant while the mean is not.

Mean MedianWith Harry Potter $150,742,300 $76,658,500

Without Harry Potter $141,889,900 $75,009,000

Page 29: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Outliers

When using statistics that are not resistant to outliers, stop and think about whether the outlier is a mistake

If not, you have to decide whether the outlier is part of your population of interest or not

Usually, for outliers that are not a mistake, it’s best to run the analysis twice, once with the outlier(s) and once without, to see how much the outlier(s) are affecting the results

Page 30: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Standard Deviation

The standard deviation for a quantitative variable measures the

spread of the data

Sample standard deviation: sPopulation standard deviation: (“sigma”)

𝑠=√∑ (𝑥−𝑥 ) 2𝑛−1

R: sd(x)

Page 31: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Standard DeviationThe standard deviation gives a rough estimate

of the typical distance of a data values from the mean

The larger the standard deviation, the more variability there is in the data and the more spread out the data are

Page 32: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Frequency

-15 -10 -5 0 5 10 15

050

150

Frequency

-15 -10 -5 0 5 10 15

050

150

Standard Deviation

1s

4s

Both of these distributions are bell-shaped

Page 33: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

95% Rule

If a distribution of data is approximately symmetric and bell-shaped, about 95%

of the data should fall within two standard deviations of the mean.

For a population, 95% of the data will be between µ – 2 and µ + 2

StatKey

Page 34: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

The 95% RuleFrequency

-3 -2 -1 0 1 2 3

050

150

Frequency

-15 -10 -5 0 5 10 15

050

150

1s

4s

Page 35: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

The 95% RuleThe standard deviation for hours of sleep per night is closest to

a) ½b) 1c) 2d) 4e) I have no idea

2.03s

Page 36: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

z-score

The z-score for a data value, x, is

For a population, is replaced with µ and s is replaced with

Values farther from 0 are more extreme

Page 37: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

z-scoreA z-score puts values on a common scale

A z-score is the number of standard deviations a value falls from the mean

95% of all z-scores fall between what two values?

z-scores beyond -2 or 2 can be considered extreme

-2 and 2

Page 38: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

z-score

Which is better, an ACT score of 28 or a combined SAT score of 2100?

ACT: = 21, = 5SAT: = 1500, = 325

Assume ACT and SAT scores have approximately bell-shaped distributions

a) ACT score of 28b) SAT score of 2100c) I don’t know

28 21 7 1.45 5

z

2100 1500 600 1.85325 325

z

Page 39: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Other Measures of Location

Maximum = largest data value

Minimum = smallest data value

Quartiles:Q1 = median of the values below m.Q3 = median of the values above m.

Page 40: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Five Number SummaryFive Number Summary:

Min MaxQ1 Q3m

25% 25% 25% 25%

R: summary(x)

Page 41: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Five Number Summary

The distribution of number of hours spent studying each week is

a) Symmetricb) Right-skewedc) Left-skewedd) Impossible to tell

> summary(study_hours) Min. 1st Qu. Median 3rd Qu. Max. 2.00 10.00 15.00 20.00 69.00

Page 42: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Percentile

The Pth percentile is the value which is greater than P% of the data

We already used z-scores to determine whether an SAT score of 2100 or an ACT score of 28 is better

We could also have used percentiles: ACT score of 28: 91st percentile SAT score of 2100: 97th percentile

Page 43: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Five Number SummaryFive Number Summary:

Min MaxQ1 Q3m

25% 25% 25% 25%

0th percentile

100th percentile

50th percentile

75th percentile

25th percentile

Page 44: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Measures of Spread

Range = Max – Min

Interquartile Range (IQR) = Q3 – Q1

Is the range resistant to outliers?a) Yesb) No

Is the IQR resistant to outliers?a) Yesb) No

The range depends entirely on the most extreme values.

The IQR is based off the middle 50% of the data, which will not contain outliers.

Page 45: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Comparing StatisticsMeasures of Center:

Mean (not resistant) Median (resistant)

Measures of Spread: Standard deviation (not resistant) IQR (resistant) Range (not resistant)

Most often, we use the mean and the standard deviation, because they are calculated based on all the data values, so use all the available information

Page 46: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

OutliersOutliers can be informally identified by

looking at a plot, but one rule of thumb for identifying outliers is data values more than 1.5 IQRs beyond the quartiles

A data value is an outlier if it is

Smaller than Q1 – 1.5(IQR)

or

Larger than Q3 + 1.5(IQR)

Page 47: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Boxplot

MedianQ1

Q3

middle 50% of data

• Lines (“whiskers”) extend from each quartile to the most extreme value that is not an outlier

Outliers

R: boxplot(x)

Page 48: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Boxplot

Which boxplot goes with the histogram of waiting times for the bus?

Histogram of Bus

Bus

Frequency

0 5 10 15 20

010

20

(a) (b) (c)

The data do not show any low outliers.

Page 49: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

StatKeywww.lock5stat.com/statkey

Page 50: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

Summary: One Quantitative VariableSummary Statistics

Center: mean, median Spread: standard deviation, range, IQR Percentiles 5 number summary

Visualization Dotplot Histogram Boxplot

Other concepts Shape: symmetric, skewed, bell-shaped Outliers, resistance z-scores

Page 51: Describing Data: One Variable

Statistics: Unlocking the Power of Data Lock5

To DoRead Sections 2.1, 2.2, 2.3, 2.4

Do Homework 1 (due Tuesday, 9/11)

If you haven’t already…

Get the textbook (at bookstore now)Get a clicker and register it (due Tuesday, 9/11)