Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt...

50
Statistical methods for Data Science, Lecture 5 Interval estimates; comparing systems Richard Johansson November 18, 2018

Transcript of Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt...

Page 1: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

Statistical methods for Data Science, Lecture 5Interval estimates; comparing systems

Richard Johansson

November 18, 2018

Page 2: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

statistical inference: overview

I estimate the value of some parameter (last lecture):I what is the error rate of my drug test?

I determine some interval that is very likely to contain the truevalue of the parameter (today):I interval estimate for the error rate

I test some hypothesis about the parameter (today):I is the error rate significantly different from 0.03?I are users significantly more satisfied with web page A than

with web page B?

Page 3: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

“recipes”

I in this lecture, we’ll look at a few “recipes” that you’ll use inthe assignmentI interval estimate for a proportion (“heads probability”)I comparing a proportion to a specified valueI comparing two proportions

I additionally, we’ll see the standard method to compute aninterval estimate for the mean of a normal

I I will also post some pointers to additional testsI remember to check that the preconditions are satisfied: what

kind of experiment? what assumptions about the data?

Page 4: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

overview

interval estimates

significance testing for the accuracy

comparing two classifiers

p-value fishing

Page 5: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

interval estimates

I if we get some estimate by ML, can we say something abouthow reliable that estimate is?

I informally, an interval estimate for the parameter p is aninterval I = [plow , phigh] so that the true value of theparameter is “likely” to be contained in I

I for instance: with 95% probability, the error rate of the spamfilter is in the interval [0.05, 0.08]

Page 6: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

frequentists and Bayesians again. . .

I [frequentist] a 95% confidence interval I is computed usinga procedure that will return intervals that contain p at least95% of the time

I [Bayesian] a 95% credible interval I for the parameter p isan interval such that p lies in I with a probability of at least95%

Page 7: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

interval estimates: overview

I we will now see two recipes for computing confidence/credibleintervals in specific situations:I for probability estimates, such as the accuracy of a classifier

(to be used in the next assignment)I for the mean, when the data is assumed to be normal

I . . . and then, a general method

Page 8: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

the distribution of our estimator

I our ML or MAP estimator applied to randomly selectedsamples is a random variable with a distribution

I this distribution depends on thesample sizeI large sample → more concentrated

distribution

0.0 0.2 0.4 0.6 0.8 1.0n = 25

0.00

0.05

0.10

0.15

Page 9: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

estimator distribution and sample size (p = 0.35)

0.0 0.2 0.4 0.6 0.8 1.0n = 10

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.0 0.2 0.4 0.6 0.8 1.0n = 25

0.00

0.05

0.10

0.15

0.0 0.2 0.4 0.6 0.8 1.0n = 50

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.0 0.2 0.4 0.6 0.8 1.0n = 100

0.00

0.02

0.04

0.06

0.08

0.10

Page 10: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

confidence and credible intervals for the proportionparameter

I several recipes, see https:

//en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

I traditional textbook method for confidence intervals is basedon approximating a binomial with a normal

I instead, we’ll consider a method to compute a Bayesiancredible interval that does not use any approximationsI works fine even if the numbers are small

Page 11: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

credible intervals in Bayesian statistics

1. choose a prior distribution

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

2. compute a posterior distribution from the prior and the data

0.0 0.2 0.4 0.6 0.8 1.0

0

1

2

3

4

5

3. select an interval that covers e.g. 95% of the posteriordistribution

0.0 0.2 0.4 0.6 0.8 1.0

0

1

2

3

4

5

Page 12: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

recipe 1: credible interval for the estimation of a probability

I assume we carry out n independent trials, with k successes,n − k failures

I choose a Beta prior for the probability; that is, select shapeparameters a and b (for uniform prior, set a = b = 1)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

I then the posterior is also a Beta, with parameters k + a and(n − k) + b

0.0 0.2 0.4 0.6 0.8 1.0

0

1

2

3

4

5

I select a 95% interval

0.0 0.2 0.4 0.6 0.8 1.0

0

1

2

3

4

5

Page 13: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

in Scipy

I assume n_success successes out of nI recall that we use ppf to get the percentiles!I or even simpler, use interval

a = 1b = a

n_fail = n - n_successposterior_distr = stats.beta(n_success + a, n_fail + b)

p_low, p_high = posterior_distr.interval(0.95)

Page 14: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

example: political polling

I we ask 87 randomly selected Gothenburgers about whetherthey support the proposed aerial tramway line over the river

I 81 of them say yesI a 95% credible interval for the popularity of the tramway is

0.857 – 0.967

n_for = 81n = 87n_against = n - n_for

p_mle = n_for / n

posterior_distr = stats.beta(n_for + 1, n_against + 1)

print(’ML / MAP estimate:’, p_mle)print(’95% credible interval: ’, posterior_distr.interval(0.95))

Page 15: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

don’t forget your common sense

I I ask 14 Applied Data Science students about whether theysupport free transporation between Johanneberg andLindholmen, 12 of them say yes

I will I get a good estimate?

Page 16: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

recipe 2: mean of a normal

I we have some sample that we assume follows some normaldistribution; we don’t know the mean µ or the standarddeviation σ; the data points are independent

I can we make an interval estimate for the parameter µ?

I frequentist confidence intervals, but also Bayesian credibleintervals, are based on the t distributionI this is a bell-shaped distribution with longer tails than the

normal

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.000.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

I the t distribution has a parameter called degrees of freedom(df) that controls the tails

Page 17: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

recipe 2: mean of a normal

I we have some sample that we assume follows some normaldistribution; we don’t know the mean µ or the standarddeviation σ; the data points are independent

I can we make an interval estimate for the parameter µ?I frequentist confidence intervals, but also Bayesian credible

intervals, are based on the t distributionI this is a bell-shaped distribution with longer tails than the

normal

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.000.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

I the t distribution has a parameter called degrees of freedom(df) that controls the tails

Page 18: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

recipe 2: mean of a normal (continued)

I x_mle is the sample mean; the size of the dataset is n; thesample standard deviation is s

I we consider a t distribution:posterior_distr = stats.t(loc = x_mle, scale = s/np.sqrt(n), df = n-1)

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.000.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

I to get an interval estimate, select a 95% interval in thisdistribution

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

Page 19: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

example

I to demonstrate, we generate some data:

x = pd.Series(np.random.normal(loc=3, scale=0.5, size=500))

I a 95% confidence/credible interval for the mean:

mu_mle = x.mean()

s = x.std()n = len(x)

posterior_distr = stats.t(df=n-1, loc=mu_mle, scale=s/np.sqrt(n))

print(’estimate:’, mu_mle)print(’95% credible interval: ’, posterior_distr.interval(0.95))

Page 20: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

alternative: estimation using bayes_mvs

I SciPy has a built-in function for the estimation of mean,variance, and standard deviation:https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/

scipy.stats.bayes_mvs.html

I 95% credible intervals for the mean and the std:

res_mean, _, res_std = stats.bayes_mvs(x, 0.95)

mu_est, (mu_low, mu_high) = res_meansigma_est, (sigma_low, sigma_high) = res_std

Page 21: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

recipe 3 (if we have time): brute force

I what if we have no clue about how our measurements aredistributed?I word error rate for speech recognitionI BLEU for machine translation

Page 22: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

the brute-force solution to interval estimates

I the variation in our estimate depends on the distribution ofpossible datasets

I in theory, we could find a confidence interval by consideringthe distribution of all possible datasets, but this can’t be donein practice

I the trick in bootstrapping – invented by Bradley Efron – is toassume that we can simulate the distribution of possibledatasets by picking randomly from the original dataset

Page 23: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

the brute-force solution to interval estimates

I the variation in our estimate depends on the distribution ofpossible datasets

I in theory, we could find a confidence interval by consideringthe distribution of all possible datasets, but this can’t be donein practice

I the trick in bootstrapping – invented by Bradley Efron – is toassume that we can simulate the distribution of possibledatasets by picking randomly from the original dataset

Page 24: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

bootstrapping a confidence interval, pseudocode

I we have a dataset D consisting of k itemsI we compute a confidence interval by generating N random

datasets and finding the interval where most estimates end up

repeat N timesD∗ = pick k items randomly from Dm = estimate on D∗

store m in a list Mreturn 2.5% and 97.5% percentiles of M

0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.980

500

1000

1500

2000

2500

3000

3500

4000

I see Wikipedia for different varieties

Page 25: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

overview

interval estimates

significance testing for the accuracy

comparing two classifiers

p-value fishing

Page 26: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

statistical significance testing for the accuracy

I in the assignment, you will consider two questions:I how sure are we that the true accuracy is different from 0.80?I how sure are we that classifier A is better than classifier B?

I we’ll see recipes that can be used in these two scenariosI these recipes work when we can assume that the “tests” (e.g.

documents) are independentI for tests in general, see e.g. Wikipedia

Page 27: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

comparing the accuracy to some given value

I my boss has told me to build a classifier with an accuracy ofat least 0.70

I my NB classifier made 40 correct predictions out of 50I so the MLE of the accuracy is 0.80

I based on this experiment, how certain can I be that theaccuracy is really different from 0.70?

I if the true accuracy is 0.70, how unusual is our outcome?

Page 28: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

null hypothesis significance tests (NHST)

I we assume a null hypothesis and then see how unusual(extreme) our outcome isI the null hypothesis is typically “boring”: the true accuracy is

equal to 0.7I the “unusualness” is measured by the p-value

I if the null hypothesis is true, how likely are we to see anoutcome as unusual as the one we got?

I the traditional threshold for p-values to be considered“significant” is 0.05

Page 29: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

the exact binomial test

I the exact binomial test is used when comparing an estimatedprobability/proportion (e.g. the accuracy) to some fixed valueI 40 correct guesses out of 50I is the true accuracy really different from 0.70?

I if the null hypothesis is true, then this experiment correspondsto a binomially distributed r.v. with parameters 50 and 0.70

I we compute the p-value as the probability of getting anoutcome at least as unusual as 40

Page 30: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

historical side note: sex ratio at birth

I the first known case where a p-value was computed involvedthe investigation of sex ratios at birth in London in 1710

I null hypothesis: P(boy) = P(girl) = 0.5I result: p close to 0 (significantly more boys)

“From whence it follows, that it is Art, not Chance, that governs.”(Arbuthnot, An argument for Divine Providence, taken from the constant

regularity observed in the births of both sexes, 1710)

Page 31: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

example

I 40 correct guesses out of 50I if the true accuracy is 0.70, is 40 out of 50 an unusual result?

0 10 20 30 40 500.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

outcome

the p-value is 0.16

Page 32: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

example

I 40 correct guesses out of 50I if the true accuracy is 0.70, is 40 out of 50 an unusual result?

0 10 20 30 40 500.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

outcome

I the p-value is 0.16, which isn’t “significantly” unusual!

Page 33: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

implementing the exact binomial test in Scipy

I assume we made x correct guesses out of nI is the accuracy significantly different from test_acc?I the p-value is the sum of the probabilities of the outcomes

that are at least as “unusual” as x:import scipy.stats

def exact_binom_test(x, n, test_acc):rv = scipy.stats.binom(n, test_acc)p_x = rv.pmf(x)p_value = 0for i in range(0, n+1):

p_i = rv.pmf(i)if p_i <= p_x:

p_value += p_ireturn p_value

I actually, we don’t have to implement it since there is afunction scipy.stats.binom_test that does exactly this!

Page 34: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

overview

interval estimates

significance testing for the accuracy

comparing two classifiers

p-value fishing

Page 35: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

comparing two classifiers

I I’m comparing a Naive Bayes and a perceptron classifierI we evaluate them on the same test setI the NB classifier had 186 correct out of 312 guessesI . . . and the perceptron had 164 correct guessesI so the ML estimates of the accuracies are 0.60 and 0.53,

respectivelyI but does this strongly support that the NB classifier is really

better?

Page 36: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

contingency table

I we make a table that compares the errors of the two classifiers:

NB correct NB incorrectperc correct A = 125 B = 39perc incorrect C = 61 D = 87

I if NB is about as good as the perceptron, the B and C valuesshould be similarI conversely if they are really different, B and C should differ

I are these B and C value unusual?

Page 37: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

McNemar’s test

I in McNemar’s test, we model the discrepancies (the B and Cvalues)

NB correct NB incorrectperc correct A = 125 B = 39perc incorrect C = 61 D = 87

I there are a number of variants of this testI the original formulation:

Quinn McNemar (1947). Note on the sampling error of thedifference between correlated proportions or percentages,Psychometrika 12:153-157.

I our version builds on the exact binomial test that we sawbefore

Page 38: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

McNemar’s test (continued)

NB correct NB incorrectperc correct A = 125 B = 39perc incorrect C = 61 D = 87

I the number of discrepancies is B+CI how are the discrepancies distributed?

I if the two systems are equivalent, the discrepancies should bemore or less evenly spread into the B and C boxes

I it can be shown that B would be a binomial random variablewith parameters B+C and 0.5

I so we can find the p-value (the “unusualness”) like this:p_value = scipy.stats.binom_test(B, B+C, 0.5)

I in this case it is 0.035, supporting the claim that NB is better

Page 39: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

alternative implementation

http://www.statsmodels.org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html

Page 40: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

overview

interval estimates

significance testing for the accuracy

comparing two classifiers

p-value fishing

Page 41: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

searching for significant effects

I scientific investigations sometimes operate according to thefollowing procedure:1. propose some hypothesis2. collect some data3. do we get a “significant” p-value over some null hypothesis?4. if no, revise hypothesis and go back to 3.5. if yes, publish your findings, promote them in the media, . . .

Page 42: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

searching for significant effects (alternative)

I or a “data science” experiment:1. you are given some dataset and told to “extract some

meaning” from it2. look at the data until you find a “significant” effect3. publish . . .

Page 43: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

searching for significant effects

I remember: if the null hypothesis is true, we will still see“significant” effects about 5% of the time

I consequence: if we search long enough, we will probably findsome effect with a p-value that is smallI even if this is just due to chance

Page 44: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

spurious correlations

Num

ber o

f people

killed b

y v

enom

ous sp

iders

Spelli

ng b

ee w

innin

g w

ord

Letters in winning word of Scripps National Spelling Beecorrelates with

Number of people killed by venomous spiders

Number of people killed by venomous spidersSpelling Bee winning word

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

0 deaths

5 deaths

10 deaths

15 deaths

5 letters

10 letters

15 letters

tylervigen.com

Page 46: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

“data dredging”: further reading

https://en.wikipedia.org/wiki/Data_dredging

https:

//en.wikipedia.org/wiki/Testing_hypotheses_suggested_by_the_data

Page 47: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

some solutions

I common senseI held-out data (a separate test set)I correcting for multiple comparisons

Page 48: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

Bonferroni correction for multiple comparisons

I assume we have an experiment where we carry out Ncomparions

I in the Bonferroni correction, we multiply the p-values of theindividual tests by N (or alternatively, divide the “significance”threshold by N)

Page 49: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

Bonferroni correction for multiple comparisons: example

Page 50: Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt inScipy I assumen_success successesoutof n I recallthatweuseppf togetthepercentiles!

-20pt

the rest of the week

I Wednesday: Naive Bayes and evaluation assignmentI Thursday: probabilistic clustering (Morteza)I Friday: QA hours (14–16)