A/B Testing Pitfalls and Lessons Learned at Spotify

33
7/2/22 Danielle Jabin A/B Testing: Avoiding Common Pitfalls

description

 

Transcript of A/B Testing Pitfalls and Lessons Learned at Spotify

Page 1: A/B Testing Pitfalls and Lessons Learned at Spotify

April 9, 2023

Danielle Jabin

A/B Testing: Avoiding Common Pitfalls

Page 2: A/B Testing Pitfalls and Lessons Learned at Spotify

Who are you?

Page 3: A/B Testing Pitfalls and Lessons Learned at Spotify

3

Page 4: A/B Testing Pitfalls and Lessons Learned at Spotify

4

Page 5: A/B Testing Pitfalls and Lessons Learned at Spotify

5

Babies named Ava caused the housing bubble

Page 6: A/B Testing Pitfalls and Lessons Learned at Spotify

6

Correlation does not equal causation!

Page 7: A/B Testing Pitfalls and Lessons Learned at Spotify

7

Correlation does not equal causation!

Page 8: A/B Testing Pitfalls and Lessons Learned at Spotify

8

What about in a product context?

Page 9: A/B Testing Pitfalls and Lessons Learned at Spotify

Do your improvements actually affect user behavior or are the variations due to chance?

A/B Testing!

Page 10: A/B Testing Pitfalls and Lessons Learned at Spotify

But…what’s an A/B test?

Page 11: A/B Testing Pitfalls and Lessons Learned at Spotify

11

Page 12: A/B Testing Pitfalls and Lessons Learned at Spotify

Pitfall #1: Not limiting your error rate

Page 13: A/B Testing Pitfalls and Lessons Learned at Spotify

13

Page 14: A/B Testing Pitfalls and Lessons Learned at Spotify

14

What if I flip a coin 100 times, and I get 51 heads?

Page 15: A/B Testing Pitfalls and Lessons Learned at Spotify

15

What if I flip a coin 100 times, and I get 5 heads?

Page 16: A/B Testing Pitfalls and Lessons Learned at Spotify

16

Page 17: A/B Testing Pitfalls and Lessons Learned at Spotify

17

●alpha levels of 5% and 1% are the most common– Alternatively: P(significant) = .05 or .01

Statistical significance: probability that a change is not due to chance alone

Page 18: A/B Testing Pitfalls and Lessons Learned at Spotify

18

●To lock in error rates before you start a test, fix your sample size

Remedies

Page 19: A/B Testing Pitfalls and Lessons Learned at Spotify

19

●To lock in error rates before you start a test, fix your sample size

What should my sample size be?

Sample size in each group (assumes equal sized groups)

Represents the desired power (typically .84 for 80% power).

Represents the desired level of statistical significance (typically 1.96).

Standard deviation of the outcome variable Effect Size (the

difference in means)

2

2/2

2

difference

)Z(2

Zn

Source: www.stanford.edu/~kcobb/hrp259/lecture11.ppt

Page 20: A/B Testing Pitfalls and Lessons Learned at Spotify

20

●With σ = 10, difference in means = 1

Let’s see some numbers

One-sided test Two-sided test

alpha = .10, beta = .80 841 1230

alpha = .05, beta = .80 1230 1568

alpha = .01, beta = .80 2010 2339

Page 21: A/B Testing Pitfalls and Lessons Learned at Spotify

Pitfall #2: Stopping your test before the fixed sample size stopping point

Page 22: A/B Testing Pitfalls and Lessons Learned at Spotify

22

●1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion rate

Let’s see some more numbers

Stop at first point of significance

Ended as significant

90% significance reached

654 of 1,000 100 of 1,000

95% significance reached

427 of 1,000 49 of 1,000

99% significance reached

146 of 1,000 14 of 1,000

Page 23: A/B Testing Pitfalls and Lessons Learned at Spotify

23

●Don’t peek●Okay, maybe you can peek, but don’t stop or make a decision before you

reach the fixed sample size stopping point●Sequential sampling

Remedies

Page 24: A/B Testing Pitfalls and Lessons Learned at Spotify

Pitfall #3: Making multiple comparisons in one test

Page 25: A/B Testing Pitfalls and Lessons Learned at Spotify

25

Let’s look at an A/B test from our Discovery feature:

Page 26: A/B Testing Pitfalls and Lessons Learned at Spotify

26

●P(significant) + P(not significant) = 1●Let’s take an alpha of .05

– P(significant) = .05– P(not significant) = 1 – P(significant) = 1 - .05 = .95

A test can be one of two things: significant or not significant

Page 27: A/B Testing Pitfalls and Lessons Learned at Spotify

27

●P(at least 1 significant) = 1 - P(none of the 2 are significant)●P(none of the 2 are significant) = P(not significant)*P(not significant) = .95*.95

= .9025●P(at least 1 significant) = 1 - .9025 = .0975

What about for two comparisons?

Page 28: A/B Testing Pitfalls and Lessons Learned at Spotify

28

●That’s almost 2x (1.95x, to be precise) your .05 significance rate!

What about for two comparisons?

Page 29: A/B Testing Pitfalls and Lessons Learned at Spotify

29

And it just gets worse…

P(at least 1 signifcant) An increase of…

5 comparisons 1 – (1-.05)^5 = .23 4.6x

10 comparisons 1 – (1-.05)^10 = .40 8x

20 comparisons 1 – (1-.05)^20 = .64 12.8x

Page 30: A/B Testing Pitfalls and Lessons Learned at Spotify

30

●Bonferroni correction– Divide P(significant), your alpha, by the number of alternate hypotheses you

have, n– alpha/n becomes the new level of statistical significance

How can we remedy this?

Page 31: A/B Testing Pitfalls and Lessons Learned at Spotify

31

●Our new P(significant) = .05/2 = .025●Our new P(not significant) = 1 - .025 = .975●P(at least 1 significant) = 1 - P(none of the 2 are significant)●P(none of the 2 are significant) = P(not significant)*P(not significant)

= .975*.975 = .951●P(at least 1 significant) = 1 - .951 = .0499

So what about two comparisons now?

Page 32: A/B Testing Pitfalls and Lessons Learned at Spotify

32

P(significant) stays under .05

Corrected alpha P(at least 1 signifcant)

5 comparisons .05/5 = .01 1 – (1-.01)^5 = .049

10 comparisons .05/10 = .005 1 – (1-.005)^10 = .049

20 comparisons .05/20 = .0025 1 – (1-.0025)^20 = .049

Page 33: A/B Testing Pitfalls and Lessons Learned at Spotify

Questions?