SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date:...

41
This time: ANOVA examples, course wrap-up.

Transcript of SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date:...

Page 1: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

This time: ANOVA examples, course wrap-up.

Page 2: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

ANOVA/Review example: We have a zombie outbreak. We’ve

identified the cause: the nefarious Bayes Virus.

So far, we have five possible treatments we’re testing by

administering them to 40 petri dishes of infected blood each in

the hopes that some or all of the treatments are effective in

stopping someone from becoming a zombie.

First thing, we want to know: Is there a significant difference in

the number of viruses (called viral load in epidemiology) per

dish between the treatments.

Page 3: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

First, do a scatterplot of the number of viruses over each of the

treatments.

Page 4: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

There are a lot of points, and they overlap, which makes the

scatterplot hard to read. We’ll also do a side-by-side boxplot

so we can see the median and related statistics.

Page 5: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Potential issue: The number of viruses is much more scattered

for treatment 5. The box is larger; so are the whiskers.

This is a problem because if one group is a lot more scattered

than the others, we can’t make the assumption of equal

variance. (Can’t use pooled standard deviation)

Page 6: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Another potential issue is the positive skew.

The upper whiskers are longer than the lower whiskers, and all

the outliers are on the are on the upper side.

Page 7: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

ANOVA assumes each group is (approximately) normal, but the

normal distribution is symmetric, not skewed.

If the distribution in each group is far from normal, this could

also make ANOVA inaccurate. (Small and moderate breaks

from normality won’t cause problems)

Page 8: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

The skew is also apparent in the histogram of each group.

Page 9: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

To use ANOVA, we need to make the equal variance

assumption; we need to pool the standard deviation.

Let’s try to find patterns related to this spread.

(From Analyze Descriptive Stats Explore)

Treatment Mean Std. Dev. Median IQR A 18.85 9.71 17.50 14 B 27.15 14.19 23.50 11 C 34.70 17.91 26 29 D 62.10 27.92 57 31 E 149.82 78.19 129 77 The mean is larger than the median in every case, more

evidence of a positive skew.

Page 10: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

The treatments with higher mean viruses also have higher

mean standard deviations.

This trend is true to a lesser degree between the median and

IQR (Interquartile Range)

Treatment Mean Std. Dev. Median IQR A 18.85 9.71 17.50 14 B 27.15 14.19 23.50 11 C 34.70 17.91 26 29 D 62.10 27.92 57 31 E 149.82 78.19 129 77

Also,

Page 11: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Why might this be?

Consider: what’s a more meaningful difference?

1 virus turning into 2, or…

10,000 viruses turning into 10,001?

Page 12: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Why might this be?

Consider: what’s a more meaningful difference?

1 virus turning into 2, or…

10,000 viruses turning into 10,001?

Viruses multiply, so what really matters is by what factor

they’ve multiplied, not how many have been added to the

group.

A difference of 1 virus means more when there are fewer of

them.

Page 13: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

We’ve run into this problem before when looking at the

correlation between GDP/Capita (money earned per person)

and life expectancy.

Some countries were 100 times richer than others.

Page 14: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Also, in terms of health, money is a lot like the viruses.

One dollar means a lot more when you’re only making $2-3 a

day than if you live a wealthy country.

Page 15: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

The problem was scaling. We needed a lens through which to

see very small amounts of money and very large amounts.

That was the log transform.

Page 16: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Without a transform, one of the requirements of a correlation

isn’t met because we had a non-linear relationship.

Without the log-transform, we still found a correlation, but it

was weaker than it should have been. We have a similar

problem if we neglect to log-transform our zombie virus data.

Page 17: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

The fairy godmother tried to transform a pumpkin into a royal

coach, but her wand was broken. bibbidi-bobbidi-beardie

Page 18: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Let’s look at the log transform of the number of viruses.

If you’re trying this analysis at home, just use the log10virus

variable instead of virus. First, the scatterplot.

Page 19: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Then the side-by-side boxplot.

The IQRs are much closer in size than before. Aside from

treatment B, they’re all about the same.

Page 20: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

The skew is gone too: The whiskers are the same length and

the outliers are appearing on both the upper and lower ends.

In other words, the data in these groups looks symmetric.

Page 21: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Finally the summary stats.

Log-Transformed Data

Treatment Mean Std. Dev. Median IQR A 1.22 0.22 1.24 0.37 B 1.38 0.22 1.37 0.20 C 1.49 0.22 1.42 0.40 D 1.76 0.18 1.76 0.24 E 2.12 0.21 2.11 0.25

The mean is increasing for each treatment, but not the

standard deviation is very similar for all five treatments. We

can use pooled standard deviation now.

Page 22: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Log-Transformed Data

Treatment Mean Std. Dev. Median IQR A 1.22 0.22 1.24 0.37 B 1.38 0.22 1.37 0.20 C 1.49 0.22 1.42 0.40 D 1.76 0.18 1.76 0.24 E 2.12 0.21 2.11 0.25

Also, the mean and median are very similar in most cases now.

There’s no trend of mean > median.

These groups are close enough to symmetrical that we’ll

assume normality.

Page 23: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Now we can do an ANOVA and have confidence in the results.

p-value is very small, so we have strong evidence that there is

some difference between the means.

With N = 200, even small differences between the means will

be detected with a small p-value. That’s from having such a

large sample.

Page 24: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Also, 20.128 / 28.922 = 0.696

69.6% of the variation in the number of viruses

(specifically in log(number of viruses))

Can be explained by the different treatments.

This leaves 30.4% to unknown factors, like any good zombie

movie.

Page 25: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

With statistics, even a zombie epidemic feels like a stroll.

Page 26: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Different analysis - Same results.

Consider the following data set and question.

We have a collection 17 trucks and 13 cars, and we’ve tracked

the amount they are driver per workday in km.

We want know if there the amount driven per day is different

between the two types of vehicles.

Response: km driven (interval)

Explanatory: type of vehicle (car or truck)

Page 27: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

This is only two means, so we could do a two-sample t-test.

There is no pairing structure, so this is an independent samples

test.

Also, by looking at the scatterplot…

…it appears that pooled variance is reasonable.

Page 28: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

We run the independent samples t-test.

First, the Levene test has a large p-value (greater than 0.05), so

we fail to reject the hypothesis of equal variance.

With equal variance, we can pool the standard deviation. That

means use the top row (equal variances assumed).

Page 29: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Against the null that the means of both groups are equal, we

have a p-value of 0.483. At alpha = 0.05, we fail to reject this

null.

The means are not significantly different.

Also, the confidence interval includes zero, so a zero difference

is feasible (this also means we fail to reject the null).

Page 30: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

We could have also used…

the t-score of .711 and…

the degrees of freedom of 29 (17 + 14 – 2)…

…to test if there was a difference between the two means (car

driving length and truck driving length) in the t-table.

(two-tailed t-critical at df=29 is 2.045. 0.711 < 2.045.

t < t-critical, so fail to reject)

Page 31: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

We could also have done an ANOVA, which works for 2 or

more groups.

Here we’re testing the null hypothesis that all the population

means are the same.

There are only two means, so really we’re testing if those two

are the same.

Page 32: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Since we’re essentially testing the same thing under the same

assumptions as an independent t-test with pooled variance,

we get the same p-value of 0.483, and…

the same degrees of freedom of 29.

For interest: The F-stat for two groups is the t-stat squared.

Page 33: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

This concludes the course material for Stat 203.

Page 34: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Mission Accomplished?

“My hope is at the end of the semester you are…

- Less intimidated by stats than at the beginning of the

semester.

- Able to handle the most common kinds of statistical

problems, and know what kinds of questions to ask of a

specialist when something more complex comes up.

- 3 credits wiser.”

Lecture 1-1

Page 35: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Where can you go with Stat 203?

ARCH 376, POL 315, and SA 355, all require Stat 203 (or

equivalent).

It’s also recommended for Criminology Honours.

From my limited experience with these courses, the program

JMP is used more than SPSS, but some knowledge will carry

over.

Page 36: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

If you end up using SPSS a lot in future work, there is

certificate you can get through IBM.

By doing your assignments, you have 20-30% of the level 1

certification already.

http://www-03.ibm.com/certify/certs/47100101.shtml

I’m afraid that’s all I know as I don’t have the certificate myself.

If you REALLY enjoyed this course and statistics in general, the

first course for a minor in Stats is Stat 270 – Probability.

Page 37: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

However, it’s quite different from this course and is a LOT

more mathematical (calculus is required a pre-requisite).

Unfortunately Stat 203 doesn’t count for credit towards a

minor in stats either.

For everyone else, who isn’t taking another stats course:

Page 38: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

I hope this serves you when the need to handle data does

come up:

A large portion new research papers in psych/crimin/sociology

or health sciences use data in some form, and often the

analysis is at or near the level that we covered in this course.

At the very least, it’s one more requirement out of the way.

Practice session: West Mall Centre 3260, 10am-noon Tuesday.

Page 39: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Feel free to leave early / drop in late.

Final exam: B9201, 3:30-6:30pm Thor’s day.

West Mall is roughly across from the gym. It’s the building with

Tim Hortons.

B9201 is one of the lecture theatres just off from the AQ main

floor (Same floor as the main exit to this room).

Page 40: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

BLUE: You are here (AQ 3181)

RED: Practice session (WMC 3260)

BLACK: Final exam (B 9201)

Page 41: SFU.ca - Simon Fraser Universityjackd/Stat203_2011/Wk13_3_Full.pdf · Author: Jack Created Date: 8/2/2012 9:19:35 PM

Recommended reading for after the final (What I wish was the

textbook was):

Outliers by Malcolm Gladwell.

Freakanomics, Super Freakanomics by Levitt and Dubner.

Predictably Irrational by Dan Arielly

The Numerati by Stephen Baker

Moneyball by Michael Lewis