Exam 2 Review

36
Statistics: Unlocking the Power of Data Exam 2 Review STAT 101 Dr. Kari Lock Morgan

description

STAT 101 Dr. Kari Lock Morgan. Exam 2 Review. Exam Details. Wednesday, 4/2 Closed to everything except two double-sided pages of notes and a non-cell phone calculator page of notes should be prepared by you – no sharing Okay to use materials from class for your page of notes - PowerPoint PPT Presentation

Transcript of Exam 2 Review

Page 1: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Exam 2 Review

STAT 101

Dr. Kari Lock Morgan

Page 2: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Exam DetailsWednesday, 4/2

• Closed to everything except two double-sided pages of notes and a non-cell phone calculator• page of notes should be prepared by you – no sharing• Okay to use materials from class for your page of notes

• Best ways to prepare:• #1: WORK LOTS OF PROBLEMS!• Make a good page of notes• Read sections you are still confused about• Come to office hours and clarify confusion

Cumulative, but emphasis is on material since Exam 1 (Chapters 5-9, we skipped 8.2 and 9.2)

Page 3: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

• Practice exam online (under resources)

• Solutions to odd essential synthesis and review problems online (under resources)

• Solutions to all odd problems in the book on reserve at Perkins

Practice Problems

Page 4: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Office Hours and HelpMonday 3 – 4pm: Prof Morgan, Old Chem 216

Monday 4–6pm: Stephanie Sun, Old Chem 211A

Tuesday 3–5pm (extra): Prof Morgan, Old Chem 216

Tuesday 5-7pm: Wenjing Shi, Old Chem 211A

Tuesday 7-9pm: Mao Hu, Old Chem 211A

REVIEW SESSION: 5–6 pm Tuesday, Social Sciences 126

Page 5: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Reminder: the Stat Education Center in Old Chem 211A is open Sunday – Thurs 4pm – 9pm with stat majors and stat PhD students available to answer questions

Stat Education Center

Page 6: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Two Options for p-valuesWe have learned two ways of calculating p-values:

The only difference is how to create a distribution of the statistic, assuming the null is true:

1)Simulation (Randomization Test): • Directly simulate what would happen, just by

random chance, if the null were true

2)Formulas and Theoretical Distributions: • Use a formula to create a test statistic for which

we know the theoretical distribution when the null is true, if sample sizes are large enough

Page 7: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Two Options for IntervalsWe have learned two ways of calculating intervals:

1)Simulation (Bootstrap): • Assess the variability in the statistic by

creating many bootstrap statistics

2)Formulas and Theoretical Distributions: • Use a formula to calculate the standard error

of the statistic, and use the normal or t-distribution to find z* or t*, if sample sizes are large enough

Page 8: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Pros and Cons1) Simulation Methods

PROS:• Methods tied directly to concepts, emphasizing

conceptual understanding• Same procedure for every statistic• No formulas or theoretical distributions to learn and

distinguish between• Minimal math needed

CONS:• Need entire dataset (if quantitative variables)• Need a computer• Newer approach

Page 9: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Pros and Cons2) Formulas and Theoretical Distributions

PROS:• Only need summary statistics• Only need a calculator• More commonly used

CONS:• Plugging numbers into formulas does little for conceptual

understanding• Many different formulas and distributions to learn and

distinguish between• Harder to see the big picture when the details are different for

each statistic• Doesn’t work for small sample sizes• Requires more math and background knowledge

Page 10: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Accuracy• The accuracy of simulation methods depends on the number of simulations (more simulations = more accurate)

• The accuracy of formulas and theoretical distributions depends on the sample size (larger sample size = more accurate)

• If the sample size is large and you have generated many simulations, the two methods should give essentially the same answer

Page 11: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Was the sample randomly selected?

Possible to generalize to

the population

Yes

Should not generalize to

the population

No

Was the explanatory variable randomly

assigned?

Possible to make

conclusions about causality

Yes

Can not make conclusions

about causality

No

Data Collection

Page 12: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Variable(s) Visualization Summary StatisticsCategorical bar chart,

pie chartfrequency table,

relative frequency table, proportion

Quantitative dotplot, histogram,

boxplot

mean, median, max, min, standard deviation,

z-score, range, IQR,five number summary

Categorical vs Categorical

side-by-side bar chart, segmented bar chart

two-way table, difference in proportions

Quantitative vs Categorical

side-by-side boxplots statistics by group, difference in means

Quantitative vs Quantitative

scatterplot correlation,simple linear regression

Page 13: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Confidence Interval

• A confidence interval for a parameter is an interval computed from sample data by a method that will capture the parameter for a specified proportion of all samples

• A 95% confidence interval will contain the true parameter for 95% of all samples

Page 14: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

• How unusual would it be to get results as extreme (or more extreme) than those observed, if the null hypothesis is true?

• If it would be very unusual, then the null hypothesis is probably not true!

• If it would not be very unusual, then there is not evidence against the null hypothesis

Hypothesis Testing

Page 15: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

• The p-value is the probability of getting a statistic as extreme (or more extreme) as that observed, just by random chance, if the null hypothesis is true

• The p-value measures evidence against the null hypothesis

p-value

Page 16: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Hypothesis Testing

1.State Hypotheses

2.Calculate a test statistic, based on your sample data

3.Create a distribution of this test statistic, as it would be observed if the null hypothesis were true

4.Use this distribution to measure how extreme your test statistic is

Page 17: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Distribution of the Sample Statistic

1.Sampling distribution: distribution of the statistic based on many samples from the population

2.Bootstrap Distribution: distribution of the statistic based on many samples with replacement from the original sample

3.Randomization Distribution: distribution of the statistic assuming the null hypothesis is true

4.Normal, t,2, F: Theoretical distributions used to approximate the distribution of the statistic

Page 18: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Sample Size Conditions

• For large sample sizes, either simulation methods or theoretical methods work

• If sample sizes are too small, only simulation methods can be used

Page 19: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

• For confidence intervals, you find the desired percentage in the middle of the distribution, then find the corresponding value on the x-axis

• For p-values, you find the value of the observed statistic on the x-axis, then find the area in the tail(s) of the distribution

Using Distributions

Page 20: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Confidence IntervalsBest Guess at Sampling Distribution

Statistic

2 3 4 5 6 7 8

Best Guess at Sampling Distribution

Statistic

2 3 4 5 6 7 8

Observed Statistic

Best Guess at Sampling Distribution

Statistic

2 3 4 5 6 7 8

Observed Statistic

P%

Best Guess at Sampling Distribution

Statistic

2 3 4 5 6 7 8

Observed Statistic

P%P%P%

Upper BoundUpper Bound

Lower Bound

Page 21: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Confidence IntervalsN(0,1)

-3 -2 -1 0 1 2 3

N(0,1)

-3 -2 -1 0 1 2 3

P%

N(0,1)

-3 -2 -1 0 1 2 3

P% z*

*sample statistic z SE Return to original scale with

Page 22: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Hypothesis TestingDistribution of Statistic Assuming Null

Statistic

-3 -2 -1 0 1 2 3

Observed Statistic

Distribution of Statistic Assuming Null

Statistic

-3 -2 -1 0 1 2 3

Distribution of Statistic Assuming Null

Statistic

-3 -2 -1 0 1 2 3

Observed Statistic

p-value

Page 23: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

General Formulas• When performing inference for a single

parameter (or difference in two parameters), the following formulas are used:

sample statistic null valueSE

z

*sample statistic z SE

Page 24: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

General Formulas• For proportions (categorical variables) with

only two categories, the normal distribution is used

• For inference involving any quantitative variable (means, correlation, slope), if categorical variables only have two categories, the t distribution is used

Page 25: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Standard Error

• The standard error is the standard deviation of the sample statistic

• The formula for the standard error depends on the type of statistic (which depends on the type of variable(s) being analyzed)

Page 26: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Parameter Distribution Standard Error

ProportionNormal

Difference in Proportions

Normal

Mean t, df = n – 1

Difference in Means t, df = min(n1, n2) – 1

Correlation t, df = n – 2

Standard Error Formulas

(1 )p pn

2

n

1 1

1

2 2

2

(1 ) (1 )p p p pn n

2 21 2

1 2n n

Page 27: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Multiple Categories• These formulas do not work for categorical

variables with more than two categories, because there are multiple parameters

• For one or two categorical variables with multiple categories, use 2 tests (goodness of fit for one categorical variable, test for association for two)

• For testing for a difference in means across multiple groups, use ANOVA

Page 28: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Chi-Square Test for Goodness of Fit1.State null hypothesized proportions for each category, pi.

Alternative is that at least one of the proportions is different than specified in the null.

2.Calculate the expected counts for each cell as npi . Make

sure they are all greater than 5 to proceed.

3.Calculate the 2 statistic:

4.Compute the p-value as the area in the tail above the 2 statistic, for a 2 distribution with df = (# of categories – 1)

5.Interpret the p-value in context.

22 observed - expected

expected

Page 29: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Chi-Square Test for Association1.H0 : The two variables are not associated

Ha : The two variables are associated

2.Calculate the expected counts for each cell:

Make sure they are all greater than 5 to proceed.

3.Calculate the 2 statistic:

4.Compute the p-value as the area in the tail above the 2 statistic, for a 2 distribution with df = (r – 1) (c – 1)

5.Interpret the p-value in context.

22 observed - expected

expected

corow total ex lumn totapected = sample e

l siz

Page 30: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Analysis of Variance

•Analysis of Variance (ANOVA) compares the variability between groups to the variability within groups

Total Variability

VariabilityBetween Groups

VariabilityWithin Groups

Page 31: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Source

Groups

Error

Total

df

k-1

n-k

n-1

Sum ofSquares

SSG

SSE

SST

MeanSquareMSG =

SSG/(k-1)MSE =

SSE/(n-k)

FStatistic

MSGMSE

p-value

Use Fk-1,n-k

ANOVA Table

Page 32: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

• Simple linear regression estimates the population model

• with the sample model:

Simple Linear Regression

0 1i i iy x

0 1ˆ ˆˆi iy x

Page 33: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

• Confidence intervals and hypothesis tests for the slope can be done using the familiar formulas:

• Population Parameter: 1, Sample Statistic:

• Use t-distribution with n – 2 degrees of freedom

Inference for the Slope

sample statistic null valueSE

t

*sample statistic t SE

Page 34: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

• A confidence interval has a given chance of capturing the mean y value at a specified x value (the point on the line)

• A prediction interval has a given chance of capturing the y value for a particular case at a specified x value (the actual point)

Intervals

Page 35: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Inference based on the simple linear model is only valid if the following conditions hold:

1) Linearity2) Constant Variability of Residuals3) Normality of Residuals

Conditions for SLR

Page 36: Exam 2 Review

Statistics: Unlocking the Power of Data Lock5

Inference Methods

http://prezi.com/c1xz1on-p4eb/stat-101/