Module - Math & Statistics Department

11
Module 1 1 Stat 2300 Descriptive Statistics Upon completion of this lesson you will be able to Identify the population, parameter of interest, sample, and sample statistic in real-world situations Identify the data type in real-world situations Calculate numerical measures of center and variability for small data sets using simple calculator functions. using the statistical functions of a calculator. Calculate numerical measures of center and variability and create graphical representations for large data sets using RStudio. The use of specialized language in a domain (like statistics) can cause a subject to seem more difficult than it actually is. For example… Image source: http://nursingcrib.com/anatomy-and-physiology/anatomy-and-physiology-cells/ The Good The Bad No (or few) new words to learn to pronounce Can use prior knowledge to connect to new concepts Old/common definitions may interfere with your understanding of the statistical term When words that are part of everyday English are used differently in a domain, these words are said to have lexical ambiguity. Word My definition/use Statistical definition/use

Transcript of Module - Math & Statistics Department

Page 1: Module - Math & Statistics Department

Module 1

1

Stat 2300Descriptive Statistics

Upon completion of this lesson you will be able to◦ Identify the population, parameter of interest,

sample, and sample statistic in real-world situations◦ Identify the data type in real-world situations◦ Calculate numerical measures of center and

variability for small data sets using simple calculator functions. using the statistical functions of a calculator.◦ Calculate numerical measures of center and

variability and create graphical representations for large data sets using RStudio.

The use of specialized language in a domain (like statistics) can cause a subject to seem more difficult than it actually is.

For example…

Image source: http://nursingcrib.com/anatomy-and-physiology/anatomy-and-physiology-cells/

The Good The Bad

No (or few) new words to learn to pronounce

Can use prior knowledge to connect to new concepts

Old/common definitions may interfere with your understanding of the statistical term

When words that are part of everyday English are used differently in a domain, these words are said to have lexical ambiguity.

Word My definition/use Statistical definition/use

Page 2: Module - Math & Statistics Department

Module 1

2

Get a piece of paper or open a file on your computer.

Take each of the words on the flashcard list below and either◦ use it in a sentence, or◦ write a short definition based on your most

common use of the word.

Pay attention and keep track Notify instructor of any words you think

should be on lists like these but aren’t!

Page 3: Module - Math & Statistics Department

Module 1

3

Population Sample

All US citizens

All students taking the SAT on a given day

All students at a certain university

All of your friends who live in the US

The students who took the SAT exam at your local high school on that given day

The students at the university who are taking an online class

Parameter Statistic

The median age of all US citizens (FYI: 37.2)

The average SAT Math score of all test takers on a given day

The actual proportion of female students at a certain university

The median age of your friends who live in the US

The average SAT Math score of the test takers at your local high school on that given day

The proportion of female students who are enrolled in at least one online class at a certain university

Page 4: Module - Math & Statistics Department

Module 1

4

Image source: http://www.cultureeveryday.com/education-cultural-literacy/the-most-important-question-about-cultural-literacy/attachment/question-mark-in-the-sky

Nominal Data

Ordinal Data

Interval Data

2-20

Also called◦ Categorical Data◦ Qualitative Data

Data set: MLB◦ Contains salary data for Major League Baseball

players in the year 2010 Variables◦ Player Name◦ Team◦ Position◦ Salary

Nominal Data◦ Bar Charts-MLB Positions

2-23

A sample of 230 applicants to a university’s business school was asked to report their undergraduate degree. Data was recorded using these codes:◦ 1=BA◦ 2=BBA◦ 3=B.Eng◦ 4=BSc◦ 5=Other

Page 5: Module - Math & Statistics Department

Module 1

5

Nominal Data◦ Bar Charts-Degree Types

2-25

Freshman = 1 Sophomore = 2 Junior = 3 Senior = 4

How is this different from the degree types example?

IDEA Evaluations◦ Choices are 1=No apparent progress 2=Slight progress 3=Moderate progress 4=Substantial progress 5=Exceptional progress

Substantial progress (4) is more than Slight progress (2) but is it exactly twice as much?

Someone with a salary of 4000 makes twice as much as someone with a salary of 2000.

Page 6: Module - Math & Statistics Department

Module 1

6

Interval Data◦ Histograms Uses numerical

scale on horizontal axis

Bars touch!

2-31

Mean Median Mode

Symbol:◦ μ if the data is from the entire population◦ ̅ if the data is from a sample

1 2

1

...nn

ii

x x xx x

n

Here are the scores on the first exam in an introductory statistics course for 6 students:

Calculate the mean.78 73 92 85 75 98

Symbol: Med, Md or Q2◦ Arrange all the data in order from smallest to

largest. The median is the number in the middle.◦ If there are an even number of observations, the

median is the average of the two middle observations.

Here are the scores on the first exam in an introductory statistics course for 6 students:

Calculate the median.78 73 92 85 75 98

Page 7: Module - Math & Statistics Department

Module 1

7

◦ If there is just one mode, the data is unimodal◦ If there are two modes, the data is bimodal◦ If more than two modes, the data is multimodal◦ Not every data set has a mode

Only allowed measure of central tendency for nominal data.

Here are the scores on the first exam in an introductory statistics course for 6 students:

Calculate the mode.78 73 92 85 75 98

What if we add one data point to the list we had before? Say this person doesn’t do well.

Student scores:

Recalculate the mean. What happens?

Recalculate the median. What happens?

78 73 92 85 75 98 22

Mean is usually the first choice for interval and ordinal data◦ Calculation includes all data points◦ Mean is sensitive to extreme values

Median is preferred when there are extreme values in the data set◦ Income data◦ Stat 2300 exam scores!

Mode ◦ Not generally used for interval data◦ The only choice for nominal data

2-40

• http://www.math.usu.edu/~schneit/CTIS/MM/

Range Variance Standard Deviation

Page 8: Module - Math & Statistics Department

Module 1

8

Knowing the center of the data is not enough. Both graphs below have the same measures

of center. Which would you rather use?

Range

Advantages◦ Quick◦ Only uses 2 observations from the whole set◦ Very sensitive to extreme values

78 73 92 85 75 98

Deviation from average◦ Take each number on a list and subtract the

average. This is its deviation.◦ Recall: average = 83.5

What is the typical size of the deviations?◦ What happens when you take the average of the

deviations?

Number 78 73 92 85 75 98Deviation

Square the deviations to make them all positive.

Compute the average of the squared deviations. This is the variance.Number Deviation Squared Deviation

78 -5.573 -10.592 8.585 1.575 -8.598 14.5

We calculate the variance differently for data from a population and data from a sample

Why?◦ We typically want to use a sample variance to

estimate a population variance◦ Populations usually have larger variance than a

sample◦ When we have data from a sample, we have to

inflate our estimate to more closely match the variance of the population Divide by n-1 instead of n

Number Deviation Squared Deviation78 -5.5 30.2573 -10.5 110.2592 8.5 72.2585 1.5 2.2575 -8.5 72.2598 14.5 210.25

Sum 0 497.5

Page 9: Module - Math & Statistics Department

Module 1

9

Population Variance Sample Variance

2i2 i=1

X  ‐ 

 

N

N

11

2

2

n-

x - x= s

n

i=i

s2

Variance is calculated in square units◦ i.e. s2 = 99.5 points squared◦ Hard or impossible to imagine

Use standard deviation instead

Population standard deviation:

Sample standard deviation:

2

2s s

Number Deviation Squared Deviation78 -5.5 30.2573 -10.5 110.2592 8.5 72.2585 1.5 2.2575 -8.5 72.2598 14.5 210.25

Sum 0 497.5

The more tools you have in your tool belt, the more flexibly you can work to solve problems.

If you (only) have a hammer…then all your problems look like nails. –Neil deGrasse Tyson

Good for small data sets and for when you already have the summary statistics.

Portable, easy to use. Secure for testing—no internet access. Recommended: TI-83, TI-84, or TI-89.◦ The TI-85 and TI-86 do not have the functions you

need for this class◦ Other calculators may, but I do not provide

instructions for them.

Many choices◦ SAS◦ SPSS◦ Minitab◦ Excel

Not going to use these…

Page 10: Module - Math & Statistics Department

Module 1

10

Open source = free Powerful, widely used in industry Platform independent—looks the same on

Macs & PCs Exposure to programming language◦ It’s good for you (and your future job prospects!)

Knowledge transfer◦ All stat software is similar◦ Learning one program makes learning another

easier I promise to hold your hand

R is a programming language! R is case sensitive! Spelling and punctuation matter!

Crawl: Learn the formulas, the ideas.◦ Do some simple calculations “by hand”

Walk: Automate the formulas using a calculator◦ When summary statistics are available◦ With small data sets (n<20 or so)

Run: Use statistical software◦ Visualize large data sets◦ Do more complicated calculations◦ Investigate alternative hypotheses

The single-season home run record was broken by Barry Bonds of the San Francisco Giants in 2001, when he hit 73 home runs. Here are Bonds’ home run totals from 1986 to 2003:

Calculate measures of center and variability for this data.

16 25 24 19 33 25 34 46 37

33 42 40 37 34 49 73 46 45

Boxplots and the 5 number summary◦ Minimum◦ Q1 – first quartile◦ Median◦ Q3 – third quartile◦ Maximum

min

Q1

median

Q3

max

Boxplots can be horizontal or vertical

Both show how the data is distributed

Histograms give a more detailed view

Page 11: Module - Math & Statistics Department

Module 1

11

The interquartile range is the difference between Q3 and Q1◦ Range=max-min◦ IQR=Q3-Q1

Outliers:◦ Data points more than Q3+1.5(IQR)◦ Data points less than Q1=1.5(IQR)

When outliers are present, the “whiskers” will be at these cutoffs instead of min and max.

min

Q1

median

Q3

max

Q3+1.5(IQR)or

the last data point below Q3+1.5(IQR)

What it looks like when a data set has lots of outliers:

A common use of boxplots is to look for differences in the distribution of an interval variable based on a nominal variable (factor)