Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing...

54
Statistical Hypothesis Testing a dramatically incomplete primer

Transcript of Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing...

Page 1: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Statistical Hypothesis Testing

a dramatically incomplete primer

Page 2: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Are you just lucky?You live in one world in which the results came out the way they did.

If we tried it in one hundred parallel worlds, in how many would it have come out the same way?

1? 80? 100?All possible samples

You are here

Page 3: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Enter statisticsHypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds would the results have come out this way?

This is what we call a p-value.p<.05 intuitively means “a result like this is likely to have come up in at least 95% of parallel worlds”(parallel world = sample)

Page 4: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Enter statisticsP-values help us to make claims about populations:

“Students have better recall after a full night’s sleep!”

...when we only tested a small sample:“...because these students had better recall after a full night’s sleep!”

Science depends upon this capacity for statistical inference

Page 5: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Why does this work?

Population distribution

Population mean

Sample mean at n=1 Sample mean at n=100

Page 6: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Why does this work?Population mean

n=100

Central Limit Theorem: as sample size grows, the distribution of the means of samples approximates a

normal distribution centered about the population mean(when samples are independent and identically distributed)

Page 7: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Hypothesis Testing

Page 8: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Step 1• Before running any analysis, VIZUALIZE!

Mean(x) 9

Mean(y) 7.5

Variance(x) 11

Variance(y) 4.125

Correlation (r) .816

Regression line y = 3 + .5x

Page 9: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Visualize!

Page 10: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Anatomy of a statistical test• If your change had no effect, what would the world look

like?

• This is known as the null hypothesis

No difference in means No slope in relationship

Page 11: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Anatomy of a statistical test• Given the difference you observed, how likely is it to

have occurred by chance?

• In this case, we reject the null hypothesis

Probability of seeing a mean difference at least this large, by chance, is 0.012

Probability of seeing a slope at least this large, by chance, is 0.012

Page 12: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

ErrorsDifference exists?

Differencedetected?

Yes No

Yes True positive False positive

No False negative True negative

Page 13: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Errors

13

Page 14: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

p-valueThe probability of seeing the observed difference by chance

In other words, P(Type I error)

Typically accepted levels: 0.05, 0.01, 0.001

Page 15: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Comparing two populations:

counts

Page 16: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Count or occurrence data• “Fifteen people completed the trial with the control

interface, and twenty two completed it with the augmented interface.”

Control Augmented

Success 5 22

Failure 35 18

Page 17: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Pearson’s Chi-Squared Test

See: http://yatani.jp/HCIstats/ChiSquare

Page 18: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Degrees of freedom??• Sparing the details, DoF can often be considered the

number of independent scores (observations) going into a statistic, minus the number of parameters we estimated (typically 1)

• Higher degrees of freedom mean we had more observations relative to the estimate (less uncertainty!)

• What if two tests explain the same variance, but one has higher DoF?

Page 19: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Comparing two populations:

means

Page 20: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Normally distributed data

mean

std. dev.

Page 21: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

t-test: do the means differ?

likely have different means(reject null hypothesis)

likely have the same mean(accept null hypothesis)

Page 22: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

t-test: do the means differ?

Numbers that matter:• Difference in means

larger means more significant

• Variance in each grouplarger means less significant

• Number of sampleslarger means more significant

Page 23: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

How many degrees of freedom?• If we know the mean of N numbers, then only N-1 of

those numbers can vary, while the last will always be constrained: (observations – estimations)

• We estimate two means, so a t-test has N-2 degrees of freedom.

Page 24: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Running the test in R• See: http://yatani.jp/HCIstats/TTest

29

Page 25: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Reporting the result• “Experts rated the designs of those in the augmented

condition (μ = 3.4, SD = 0.4) significantly higher than the designs of those in the control condition (μ = 2.0, SD = 0.5), according to an independent samples t-test (t(18) = -2.2, p < .05).”

?

Page 26: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Within-subjects study designs

• It can be easier to statistically detect a difference if the participants try both alternatives. Why?

• What are the potential issues?

Condition 1

Condition 2

Condition 1 Condition 2

Between-subjects Within-subjects

Page 27: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Paired-samples t-test• If we consider each data point to be

independent, then we find no significance (p = .491), because the between-group variance is small relative to the within-group variance

• A paired-samples t-test accounts for differences between individuals, revealing the effect of condition on each (p < .001)

Page 28: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Running a paired t-test in R

36

See: http://yatani.jp/teaching/doku.php?id=hcistats:ttest#a_paired_t_test

Page 29: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

ANOVA

Page 30: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

t-test: compare two means• “Do people fix more bugs with our IDE bug suggestion

callouts?”

Page 31: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

ANOVA: compare N means• “Do people fix more bugs with our IDE bug suggestion

callouts, with warnings, or with nothing?”

Page 32: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

total deviationfrom grand mean

deviation of factor mean from grand mean

deviation of response from factor mean

Rough intuition for ANOVA testHow much of the total variation can be accounted for by looking at the means of each condition?

Page 33: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Reporting an ANOVA• “A one-way ANOVA revealed a significant

difference in the effect of news feed source on number of likes (F(2, 21)=12.1, p<.001).”43

Page 34: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Repeated measures• Note: when your analysis includes any within-

subjects factors, use a repeated measures ANOVA.

Page 35: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Post-hoc tests

Page 36: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Omnibus Prime• ANOVA is an omnibus test. It compares all levels of all

factors.• When ANOVA is significant, it means “At least one of

the means is different.” Which one(s)? By how much?

Page 37: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Pairwise (post-hoc) tests

0.0

22.5

45.0

67.5

90.0

Friend feed Stranger feed Michael feed

Mea

n lik

es

Page 38: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

The problem with many tests• implies a .95 probability of being correct in

rejecting the null hypothesis• If we do m tests, the actual probability of being correct

is now:

• This is called family-wise error rate

Page 39: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Bonferroni correction• Correct for family-wise error by adjusting to be more

conservative• Divide by the number of comparisons you make

• 4 tests at implies using • Conservative but safe method of compensating for

multiple tests• Note: you lose power when conducting lots of tests – so

be judicious and plan comparisons via hypotheses!

Page 40: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Bonferroni correction

51

Page 41: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Tukey test• Less conservative than Bonferroni• Compares all pairs of factor level means

52

Page 42: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Factorial ANOVA

Page 43: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Crossed study designs• Suppose you wanted to measure whether a drug works

for two types of headaches. You have two factors: • Treatment vs. placebo• Migraine vs. tension headache

• This is a 2 x 2 study: each factor has two levels• A factorial ANOVA can serve as an omnibus in this case

Page 44: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Interaction effects• The study reveals two main effects:

• Those in the treatment condition tend to have less pain than those in the placebo condition

• Those with tension headaches tend to have less pain than those with migraines

• There is also an interaction effect:• Those who had tension headaches had

a larger effect from the treatment than those with migraines

0123456789

10

Treatment Placebo

Pain

Treatment Condition

Migraine

Tension

Page 45: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Two-factor ANOVA test

58

Both main effects and interaction significant!

Page 46: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Nonparametric tests

Page 47: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Parametric assumptions• To use the tests we’ve seen thus far, three criteria must

be met:• Independence – each subject is sampled independently of

every other subject, and measures from one subject are independent of measures on any other subject

• Normality – data is normally distributed (technically, error terms are normally distributed)

• Homoskedasticity – the variance is similar across all levels of factors

Page 48: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Parametric assumptions• Non-parametric tests do not make these assumptions.

Use them for cases like:• Rankings or other ordinal data• Non-uniform variance

Page 49: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Equivalent nonparametric testsParametric Nonparametric

Unpaired t-test Mann-Whitney U

Paired t-test Wilcoxon matched pairs

ANOVA Kruskal-Wallis

Repeated measures ANOVA

Friedman test

Page 50: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Effect size• Significance tests inform us about the likelihood of a

meaningful difference between groups, but they don’t always tell us the magnitude of that difference.

• Because any difference will become “significant” with an arbitrarily large sample, it’s important to quantify the effect size that you observe

• We report either standardized or unstandardizedeffect sizes. When would you use each?

Page 51: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Effect size• Some common measures of effect size:

• Unstandardized:• The raw difference between means• The raw regression coefficient

• Standardized:• Pearson’s r (for correlations)• Cohen’s d (for differences between means)• 𝜂𝜂2 (to explain the variance explained by one factor,

controlling for all others)

Page 52: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

A reference for analyses• DO NOT blindly apply these methods – know what your

analysis is considering, and when in doubt ask for assistance!

Samples (i.e. factors)

Response Categories Tests

1 2 One sample 𝜒𝜒2 test, binomial test

1 >2 One sample χ2 test, multinomial test

>1 ≥2 N-sample χ2 test, G-test, Fisher's exact test

Factors Levels (B)etween or (W)ithin

Parametric Tests Nonparametric Tests

1 2 B Independent-samples t-testWelch's t-test (if non-homoscedastic)

Mann-Whitney U Test

1 >2 B One-way ANOVA Kruskal-Wallis Test

1 2 W Paired-samples t-test Wilcoxon Signed-Rank Test

1 >2 W One-way repeated measures ANOVA Friedman Test

>1 ≥2 B Factorial ANOVALinear Models

Aligned Rank Transform (ART)Generalized Linear Models (GLM)

>1 ≥2 W Factorial repeated measures ANOVALinear Mixed Models (LMM)

Aligned Rank Transform (ART)Generalized Linear Mixed Models (GLMM)

Tests of Proportions (e.g. counts)

Analyses of Variance (comparing means)

Page 53: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Summary• Our goal is to make inferences about population

characteristics from a sample• Plan your study with your methods in mind• Always visualize data first• Check parametric assumptions• Run omnibus, and if called-for, post-hoc tests• Correct your family-wise error rate as necessary• Report test statistic, DoF, p-value, and effect size

Page 54: Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds

Resources• Statistics drop-in office hours:

https://statistics.stanford.edu/resources/consulting• Our office hours• MOOCs:

https://www.coursera.org/learn/designexperiments