Statistics in 40 Minutes: A/B Testing Fundamentals

92
Statistics in 40 Minutes: A/B Testing Fundamentals Leo Pekelis Statistician, Optimizely @lpekelis [email protected] #opticon2015

Transcript of Statistics in 40 Minutes: A/B Testing Fundamentals

Page 1: Statistics in 40 Minutes: A/B Testing Fundamentals

Statistics in 40 Minutes: A/B Testing FundamentalsLeo PekelisStatistician, Optimizely

@[email protected]

#opticon2015

Page 2: Statistics in 40 Minutes: A/B Testing Fundamentals

You have your own unique approach to A/B Testing

Page 3: Statistics in 40 Minutes: A/B Testing Fundamentals

The goal of this talk is to break down A/B Testing to its

fundamentals.

Page 4: Statistics in 40 Minutes: A/B Testing Fundamentals
Page 5: Statistics in 40 Minutes: A/B Testing Fundamentals

A/B Testing Platform1) Create an experiment

2) Read the results page

Page 6: Statistics in 40 Minutes: A/B Testing Fundamentals

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

Page 7: Statistics in 40 Minutes: A/B Testing Fundamentals

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

Page 8: Statistics in 40 Minutes: A/B Testing Fundamentals

1. What makes a good hypothesis?

2. What are the differences between False Positive Rate and False Discovery Rate?

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

At the end of this talk, you should be able to answer

Page 9: Statistics in 40 Minutes: A/B Testing Fundamentals

1. A good hypothesis has a variation and a clearly defined goal which you crafted ahead of time.

2. False Positive Rates have hidden sources of error when testing many goals and variations. False Discovery Rates correct this by increasing runtime when you have many low signal goals.

3. All three levers are inversely related. For example, running my tests longer can get me lower error rates, or detect smaller effects.

The answers

Page 10: Statistics in 40 Minutes: A/B Testing Fundamentals

First, some vocabulary (yay!)

Page 11: Statistics in 40 Minutes: A/B Testing Fundamentals

• Control and Variation A control is the original, or baseline version of content that you are testing through a variation.

• Goal Metric used to measure impact of control and variation

• Baseline conversion rate The control group’s expected conversion rate.

• Effect size The improvement (positive or negative) of your variation over baseline.

• Sample size The number of visitors in your test.

Page 12: Statistics in 40 Minutes: A/B Testing Fundamentals

• A hypothesis test is a control, and variation that you

want to show improves a goal

• An experiment is a collection of hypotheses (goals &

variation pairs) that all have the same control

Page 13: Statistics in 40 Minutes: A/B Testing Fundamentals

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

Page 14: Statistics in 40 Minutes: A/B Testing Fundamentals

http://www.nba.com/

Imagine we are the NBA

Page 15: Statistics in 40 Minutes: A/B Testing Fundamentals

What is a good hypothesis (test)?

Page 16: Statistics in 40 Minutes: A/B Testing Fundamentals

Why is this not actionable?

“I think changing the header image will make

my site better.”

• Removing the header will increase engagement?

• Removing the header will increase total revenue?

• Removing the header will increase “the finals” clicks?

• Growing the header will increase engagement?

• Growing the header with increase “the finals” clicks.

• …

Test creep!Bad hypothesis

Page 17: Statistics in 40 Minutes: A/B Testing Fundamentals

• Removing the header will increase engagement?

• Removing the header will increase total revenue?

• Removing the header will increase “the finals” clicks?

• Growing the header will increase engagement?

• Growing the header with increase “the finals” clicks.

• …

• Removing the header will increase engagement?

• Removing the header will increase total revenue?

• Removing the header will increase “the finals” clicks?

• Growing the header will increase engagement?

• Growing the header with increase “the finals” clicks.

• …

“I think changing the header image will make

my site better.”

Bad hypothesis Good hypotheses

Organized and clear

Page 18: Statistics in 40 Minutes: A/B Testing Fundamentals

The more relationships (hypotheses) you test,

the longer (visitors) it will take

to achieve the same outcome (error rate).

Hypotheses also give the cost of your experiment

Page 19: Statistics in 40 Minutes: A/B Testing Fundamentals

Questions to check for a good hypothesis

What are you trying to show with your idea?

What key metrics should it drive?

Are all my goals and variations necessary given my testing limits?

Page 20: Statistics in 40 Minutes: A/B Testing Fundamentals

1. What makes a good hypothesis?

2. What are the differences between False Positive Rate and False Discovery Rate?

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

At the end of this talk, you should be able to answer

Page 21: Statistics in 40 Minutes: A/B Testing Fundamentals

1. What makes a good hypothesis?

Answer: A good hypothesis has a variation and a clearly defined goal which you crafted ahead of time.

2. What are the differences between False Positive Rate and False Discovery Rate?

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

At the end of this talk, you should be able to answer

Page 22: Statistics in 40 Minutes: A/B Testing Fundamentals

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

Page 23: Statistics in 40 Minutes: A/B Testing Fundamentals

http://www.nba.com/

Page 24: Statistics in 40 Minutes: A/B Testing Fundamentals

What are the possible outcomes?

Page 25: Statistics in 40 Minutes: A/B Testing Fundamentals

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

(no effect, winner / loser)

Page 26: Statistics in 40 Minutes: A/B Testing Fundamentals

http://www.nba.com/

Page 27: Statistics in 40 Minutes: A/B Testing Fundamentals

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

(no effect, winner / loser)

(+/- improvement, inconclusive)

Page 28: Statistics in 40 Minutes: A/B Testing Fundamentals

http://www.nba.com/

Page 29: Statistics in 40 Minutes: A/B Testing Fundamentals

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

(no effect, winner / loser)

(+/- improvement, inconclusive)

(+/- improvement, winner / loser)

(no effect, inconclusive)

Page 30: Statistics in 40 Minutes: A/B Testing Fundamentals

The 2x2 table will help us to

1. Keep track of different error rates we care about

2. Explore the consequences of controlling false positives vs false

discoveries

Page 31: Statistics in 40 Minutes: A/B Testing Fundamentals

Error rate 1: False positive rate

Page 32: Statistics in 40 Minutes: A/B Testing Fundamentals

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False positive rate (Type I error)

= “Chance of a false positive from a variation with no effect on a goal”

Page 33: Statistics in 40 Minutes: A/B Testing Fundamentals

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False positive rate (Type I error)

= “Chance of a false positive from a variation with no effect on a goal”

Page 34: Statistics in 40 Minutes: A/B Testing Fundamentals

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False positive rate (Type I error)

= “Chance of a false positive from a variation with no effect on a goal”

Page 35: Statistics in 40 Minutes: A/B Testing Fundamentals

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False positive rate (Type I error)

= “Chance of a false positive from a variation with no effect on a goal”

• Thresholding the FPR

“When I have a variation with no effect on a goal, I’ll find an effect less than 10% of the time.”

Page 36: Statistics in 40 Minutes: A/B Testing Fundamentals

How can we ever compute a False Positive Rate if we don’t know whether a hypothesis is true or not?

Statistical tests (fixed horizon t-test, Stats Engine) are designed to

threshold an error rate.

Example:

“Calling winners & losers when a p-value is below .05 will guarantee a

False Positive Rate below 5%”

Page 37: Statistics in 40 Minutes: A/B Testing Fundamentals

False Positive Rates with multiple tests

Page 38: Statistics in 40 Minutes: A/B Testing Fundamentals

https://xkcd.com/882/

Page 39: Statistics in 40 Minutes: A/B Testing Fundamentals

https://xkcd.com/882/

Page 40: Statistics in 40 Minutes: A/B Testing Fundamentals

https://xkcd.com/882/

Page 41: Statistics in 40 Minutes: A/B Testing Fundamentals

What happened?

21 tests X 5% FPR = 1 False Positive on average

Page 42: Statistics in 40 Minutes: A/B Testing Fundamentals

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner / Loser

True positive

False positive

Inconc-lusive

False negative

True negative

False positive rates are only useful in the context of all hypotheses

Page 43: Statistics in 40 Minutes: A/B Testing Fundamentals

Error rate 2: False discovery rate

Page 44: Statistics in 40 Minutes: A/B Testing Fundamentals

• False discovery rate (FDR)

= “Chance of a false positive from a conclusive result”

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

Page 45: Statistics in 40 Minutes: A/B Testing Fundamentals

• False discovery rate (FDR)

= “Chance of a false positive from a conclusive result”

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

Page 46: Statistics in 40 Minutes: A/B Testing Fundamentals

• False discovery rate (FDR)

= “Chance of a false positive from a conclusive result”

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

Page 47: Statistics in 40 Minutes: A/B Testing Fundamentals

• False discovery rate (FDR)

= “Chance of a false positive from a conclusive result”

• Thresholding the FDR

= “When you see a winning or losing goal on a variation, it’s wrong less than 10% of the time.”

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

Page 48: Statistics in 40 Minutes: A/B Testing Fundamentals

or X 5% FDR = 0.05 False Positives on average

Page 49: Statistics in 40 Minutes: A/B Testing Fundamentals

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner / loser

True positive

False positive

Inconc-lusive

False negative

True negative

False discovery rates are useful despite the number of hypotheses

Page 50: Statistics in 40 Minutes: A/B Testing Fundamentals

What’s the catch?

The more hypotheses (goals & variations) in your experiment, the longer it takes to find conclusives.

Page 51: Statistics in 40 Minutes: A/B Testing Fundamentals

Not quite …

Page 52: Statistics in 40 Minutes: A/B Testing Fundamentals

low signal high signal

Page 53: Statistics in 40 Minutes: A/B Testing Fundamentals

What’s the catch?

The more low signal hypotheses (goals & variations) in your experiment, the longer it takes to find conclusives.

Page 54: Statistics in 40 Minutes: A/B Testing Fundamentals

Recap

• False Positive Rate thresholding

-controls the chance of a false positive when you have a hypothesis with no effect

-misrepresents your error rate with multiple goals and variations

• False Discovery Rate thresholding

-controls the chance of a false positive when you have a winning or losing hypothesis

-is accurate regardless of how many hypotheses you run

-can take longer to reach significance with more low signal variations on goals

Page 55: Statistics in 40 Minutes: A/B Testing Fundamentals

Tips & Tricks for running experiments with False Discovery Rates

• Ask: Which goal is most important to me?

-This should be my primary goal (not impacted by all other goals)

• Run large, or large multivariate tests without fear of finding spurious results, but be prepared for the cost of exploration

• A little human intuition and prior knowledge can go a long way towards reducing the runtime of your experiments

Page 56: Statistics in 40 Minutes: A/B Testing Fundamentals

1. What makes a good hypothesis?

2. What are the differences between False Positive Rate and False Discovery Rate?

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

At the end of this talk, you should be able to answer

Page 57: Statistics in 40 Minutes: A/B Testing Fundamentals

1. What makes a good hypothesis?

2. What are the differences between False Positive Rate and False Discovery Rate?

False Positive Rates have hidden sources of error when testing many goals and variations. False Discovery Rates correct this by increasing runtime when you have many noisy goals.

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

At the end of this talk, you should be able to answer

Page 58: Statistics in 40 Minutes: A/B Testing Fundamentals

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

Page 59: Statistics in 40 Minutes: A/B Testing Fundamentals

“3 Levers” of A/B Testing

1.Threshold an error rate

• “I want no more than 10% false discovery rate”

2.Detecting effect sizes (setting an MDE)

• “I’m OK with only detecting greater than 5% improvement”

3.Running tests longer

• “I can afford to run this test for 3 weeks, or 50,000

visitors”

Page 60: Statistics in 40 Minutes: A/B Testing Fundamentals

Fundamental Tradeoff of A/B Testing

Error rates Runtime

Effect size / Baseline CR

All Inversely Related

Page 61: Statistics in 40 Minutes: A/B Testing Fundamentals

Error rates Runtime

Effect size / Baseline CR

At any number of visitors,

the less you threshold your error rate,

the smaller effect sizes you can detect.

All Inversely Related

Page 62: Statistics in 40 Minutes: A/B Testing Fundamentals

Error rates Runtime

Effect size / Baseline CR

All Inversely Related

At any error rate threshold,

stopping your test earlier means

you can only detect larger effect sizes.

Page 63: Statistics in 40 Minutes: A/B Testing Fundamentals

Error rates Runtime

Effect size / Baseline CR

All Inversely Related

For any effect size,

the lower error rate you want,

the longer you need to run your test.

Page 64: Statistics in 40 Minutes: A/B Testing Fundamentals

What does this look like in practice?

Average Visitors needed to reach significance with Stats Engine

Improvement (relative)

5% 10% 25%

Significance Threshold

95% 62,400 13,500 1,800

90% 59,100 12,800 1,700

80% 52,600 11,400 1,500

Baseline conversion rate = 10%

Page 65: Statistics in 40 Minutes: A/B Testing Fundamentals

All A/B Testing platforms address the fundamental tradeoff …

1. Choose a minimum detectable effect (MDE) and false positive rate threshold

2. Find a required sampled minimum sample size with a sample size calculator

3. Wait until the minimum sample size is reached

4. Look at your results once and only once

Page 66: Statistics in 40 Minutes: A/B Testing Fundamentals

Optimizely is the only platform that lets you pull the levers in

real time

Page 67: Statistics in 40 Minutes: A/B Testing Fundamentals

http://www.nba.com/

Page 68: Statistics in 40 Minutes: A/B Testing Fundamentals

5%

Error rates Runtime

Effect size / Baseline CR

-

+5%, 10%

52,600

?In the beginning, we make an educated guess …

Page 69: Statistics in 40 Minutes: A/B Testing Fundamentals

5%

Error rates Runtime (remaining)

Effect size / Baseline CR

-

+13%, 16%

1,600

Instead of: 52,600 - 7,200 = 45,400

… but then the improvement turns out to be better …

Page 70: Statistics in 40 Minutes: A/B Testing Fundamentals

5%

Error rates Runtime (remaining)

Effect size / Baseline CR

-

+2%, 8%

> 100,000

… or a lot worse.

Page 71: Statistics in 40 Minutes: A/B Testing Fundamentals

Recap

• The Fundamental Tradeoff of A/B Testing affects you no matter what testing platform you use.

-If you want to detect a 5% Improvement on a 10% baseline conversion rate, you should be prepared to wait for at least 50,000 visitors

• Optimizely’s Stats Engine is the only platform that allows you to adjust the trade-off in real time while still reporting valid error rates

Page 72: Statistics in 40 Minutes: A/B Testing Fundamentals

1. What makes a good hypothesis?

2. What are the differences between False Positive Rate and False Discovery Rate?

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

At the end of this talk, you should be able to answer

Page 73: Statistics in 40 Minutes: A/B Testing Fundamentals

1. What makes a good hypothesis?

2. What are the differences between False Positive Rate and False Discovery Rate?

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

All three are inversely related. For example, running my tests longer can get me lower error rates, or detect smaller effects.

At the end of this talk, you should be able to answer

Page 74: Statistics in 40 Minutes: A/B Testing Fundamentals

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

Page 75: Statistics in 40 Minutes: A/B Testing Fundamentals

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

Page 76: Statistics in 40 Minutes: A/B Testing Fundamentals

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

Page 77: Statistics in 40 Minutes: A/B Testing Fundamentals
Page 78: Statistics in 40 Minutes: A/B Testing Fundamentals

Statistics in 40 Minutes: A/B Testing FundamentalsLeo PekelisStatistician, Optimizely

@[email protected]

#opticon2015

Page 79: Statistics in 40 Minutes: A/B Testing Fundamentals

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

Page 80: Statistics in 40 Minutes: A/B Testing Fundamentals
Page 81: Statistics in 40 Minutes: A/B Testing Fundamentals

Definition:

A confidence interval is a range of values for your metric (revenue, conversion rate, etc.) that is 90%* likely to contain the true difference between your variation and baseline.

Page 82: Statistics in 40 Minutes: A/B Testing Fundamentals

15.41

11.4Middle Ground

Best Case

Worst case

7.29

Page 83: Statistics in 40 Minutes: A/B Testing Fundamentals

This is true regardless of your significance.

Page 84: Statistics in 40 Minutes: A/B Testing Fundamentals

http://www.nba.com/

Page 85: Statistics in 40 Minutes: A/B Testing Fundamentals

We can’t wait for significance

Page 86: Statistics in 40 Minutes: A/B Testing Fundamentals

The confidence interval tells us what we need to know

Page 87: Statistics in 40 Minutes: A/B Testing Fundamentals

A confidence interval is the mirror image of statistical significance

Mathematical Definition:

The set of parameter values X so that a hypothesis test with null hypothesis

H0: Removing a distracting header will result in X more revenue per visitor.

is not yet rejected.

Page 88: Statistics in 40 Minutes: A/B Testing Fundamentals

Error rate 3: False negative rate

Page 89: Statistics in 40 Minutes: A/B Testing Fundamentals

• False negative rate (Type II error)

= “Rate of false negatives from all hypotheses that could have been false negatives.”

“True” value of hypothesis

Improve-ment No effect

Outcome of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

Page 90: Statistics in 40 Minutes: A/B Testing Fundamentals

• False negative rate (Type II error)

= “Rate of false negatives from all hypotheses that could have been false negatives.”

“True” value of hypothesis

Improve-ment No effect

Outcome of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

Page 91: Statistics in 40 Minutes: A/B Testing Fundamentals

• False negative rate (Type II error)

= “Rate of false negatives from all hypotheses that could have been false negatives.”

“True” value of hypothesis

Improve-ment No effect

Outcome of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

Page 92: Statistics in 40 Minutes: A/B Testing Fundamentals

• False negative rate (Type II error)

= “Rate of false negatives from all variations with an improvement on a goal.”

= #(False negative) / #(Improve)

• Thresholding Type II error

= “When have a goal on a variation with an effect, you miss it less than 10% of the time.”

“True” value of hypothesis

Improve-ment No effect

Outcome of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative