Statistics in 40 Minutes: A/B Testing Fundamentals

Post on 12-Aug-2015

90 views 8 download

Tags:

Transcript of Statistics in 40 Minutes: A/B Testing Fundamentals

Statistics in 40 Minutes: A/B Testing FundamentalsLeo PekelisStatistician, Optimizely

@lpekelisleonid@optimizely.com

#opticon2015

You have your own unique approach to A/B Testing

The goal of this talk is to break down A/B Testing to its

fundamentals.

A/B Testing Platform1) Create an experiment

2) Read the results page

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

1. What makes a good hypothesis?

2. What are the differences between False Positive Rate and False Discovery Rate?

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

At the end of this talk, you should be able to answer

1. A good hypothesis has a variation and a clearly defined goal which you crafted ahead of time.

2. False Positive Rates have hidden sources of error when testing many goals and variations. False Discovery Rates correct this by increasing runtime when you have many low signal goals.

3. All three levers are inversely related. For example, running my tests longer can get me lower error rates, or detect smaller effects.

The answers

First, some vocabulary (yay!)

• Control and Variation A control is the original, or baseline version of content that you are testing through a variation.

• Goal Metric used to measure impact of control and variation

• Baseline conversion rate The control group’s expected conversion rate.

• Effect size The improvement (positive or negative) of your variation over baseline.

• Sample size The number of visitors in your test.

• A hypothesis test is a control, and variation that you

want to show improves a goal

• An experiment is a collection of hypotheses (goals &

variation pairs) that all have the same control

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

http://www.nba.com/

Imagine we are the NBA

What is a good hypothesis (test)?

Why is this not actionable?

“I think changing the header image will make

my site better.”

• Removing the header will increase engagement?

• Removing the header will increase total revenue?

• Removing the header will increase “the finals” clicks?

• Growing the header will increase engagement?

• Growing the header with increase “the finals” clicks.

• …

Test creep!Bad hypothesis

• Removing the header will increase engagement?

• Removing the header will increase total revenue?

• Removing the header will increase “the finals” clicks?

• Growing the header will increase engagement?

• Growing the header with increase “the finals” clicks.

• …

• Removing the header will increase engagement?

• Removing the header will increase total revenue?

• Removing the header will increase “the finals” clicks?

• Growing the header will increase engagement?

• Growing the header with increase “the finals” clicks.

• …

“I think changing the header image will make

my site better.”

Bad hypothesis Good hypotheses

Organized and clear

The more relationships (hypotheses) you test,

the longer (visitors) it will take

to achieve the same outcome (error rate).

Hypotheses also give the cost of your experiment

Questions to check for a good hypothesis

What are you trying to show with your idea?

What key metrics should it drive?

Are all my goals and variations necessary given my testing limits?

1. What makes a good hypothesis?

2. What are the differences between False Positive Rate and False Discovery Rate?

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

At the end of this talk, you should be able to answer

1. What makes a good hypothesis?

Answer: A good hypothesis has a variation and a clearly defined goal which you crafted ahead of time.

2. What are the differences between False Positive Rate and False Discovery Rate?

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

At the end of this talk, you should be able to answer

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

http://www.nba.com/

What are the possible outcomes?

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

(no effect, winner / loser)

http://www.nba.com/

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

(no effect, winner / loser)

(+/- improvement, inconclusive)

http://www.nba.com/

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative

(no effect, winner / loser)

(+/- improvement, inconclusive)

(+/- improvement, winner / loser)

(no effect, inconclusive)

The 2x2 table will help us to

1. Keep track of different error rates we care about

2. Explore the consequences of controlling false positives vs false

discoveries

Error rate 1: False positive rate

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False positive rate (Type I error)

= “Chance of a false positive from a variation with no effect on a goal”

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False positive rate (Type I error)

= “Chance of a false positive from a variation with no effect on a goal”

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False positive rate (Type I error)

= “Chance of a false positive from a variation with no effect on a goal”

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False positive rate (Type I error)

= “Chance of a false positive from a variation with no effect on a goal”

• Thresholding the FPR

“When I have a variation with no effect on a goal, I’ll find an effect less than 10% of the time.”

How can we ever compute a False Positive Rate if we don’t know whether a hypothesis is true or not?

Statistical tests (fixed horizon t-test, Stats Engine) are designed to

threshold an error rate.

Example:

“Calling winners & losers when a p-value is below .05 will guarantee a

False Positive Rate below 5%”

False Positive Rates with multiple tests

https://xkcd.com/882/

https://xkcd.com/882/

https://xkcd.com/882/

What happened?

21 tests X 5% FPR = 1 False Positive on average

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner / Loser

True positive

False positive

Inconc-lusive

False negative

True negative

False positive rates are only useful in the context of all hypotheses

Error rate 2: False discovery rate

• False discovery rate (FDR)

= “Chance of a false positive from a conclusive result”

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False discovery rate (FDR)

= “Chance of a false positive from a conclusive result”

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False discovery rate (FDR)

= “Chance of a false positive from a conclusive result”

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False discovery rate (FDR)

= “Chance of a false positive from a conclusive result”

• Thresholding the FDR

= “When you see a winning or losing goal on a variation, it’s wrong less than 10% of the time.”

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

or X 5% FDR = 0.05 False Positives on average

“True” value of hypothesis

Improve-ment No effect

Result of test

Winner / loser

True positive

False positive

Inconc-lusive

False negative

True negative

False discovery rates are useful despite the number of hypotheses

What’s the catch?

The more hypotheses (goals & variations) in your experiment, the longer it takes to find conclusives.

Not quite …

low signal high signal

What’s the catch?

The more low signal hypotheses (goals & variations) in your experiment, the longer it takes to find conclusives.

Recap

• False Positive Rate thresholding

-controls the chance of a false positive when you have a hypothesis with no effect

-misrepresents your error rate with multiple goals and variations

• False Discovery Rate thresholding

-controls the chance of a false positive when you have a winning or losing hypothesis

-is accurate regardless of how many hypotheses you run

-can take longer to reach significance with more low signal variations on goals

Tips & Tricks for running experiments with False Discovery Rates

• Ask: Which goal is most important to me?

-This should be my primary goal (not impacted by all other goals)

• Run large, or large multivariate tests without fear of finding spurious results, but be prepared for the cost of exploration

• A little human intuition and prior knowledge can go a long way towards reducing the runtime of your experiments

1. What makes a good hypothesis?

2. What are the differences between False Positive Rate and False Discovery Rate?

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

At the end of this talk, you should be able to answer

1. What makes a good hypothesis?

2. What are the differences between False Positive Rate and False Discovery Rate?

False Positive Rates have hidden sources of error when testing many goals and variations. False Discovery Rates correct this by increasing runtime when you have many noisy goals.

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

At the end of this talk, you should be able to answer

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

“3 Levers” of A/B Testing

1.Threshold an error rate

• “I want no more than 10% false discovery rate”

2.Detecting effect sizes (setting an MDE)

• “I’m OK with only detecting greater than 5% improvement”

3.Running tests longer

• “I can afford to run this test for 3 weeks, or 50,000

visitors”

Fundamental Tradeoff of A/B Testing

Error rates Runtime

Effect size / Baseline CR

All Inversely Related

Error rates Runtime

Effect size / Baseline CR

At any number of visitors,

the less you threshold your error rate,

the smaller effect sizes you can detect.

All Inversely Related

Error rates Runtime

Effect size / Baseline CR

All Inversely Related

At any error rate threshold,

stopping your test earlier means

you can only detect larger effect sizes.

Error rates Runtime

Effect size / Baseline CR

All Inversely Related

For any effect size,

the lower error rate you want,

the longer you need to run your test.

What does this look like in practice?

Average Visitors needed to reach significance with Stats Engine

Improvement (relative)

5% 10% 25%

Significance Threshold

95% 62,400 13,500 1,800

90% 59,100 12,800 1,700

80% 52,600 11,400 1,500

Baseline conversion rate = 10%

All A/B Testing platforms address the fundamental tradeoff …

1. Choose a minimum detectable effect (MDE) and false positive rate threshold

2. Find a required sampled minimum sample size with a sample size calculator

3. Wait until the minimum sample size is reached

4. Look at your results once and only once

Optimizely is the only platform that lets you pull the levers in

real time

http://www.nba.com/

5%

Error rates Runtime

Effect size / Baseline CR

-

+5%, 10%

52,600

?In the beginning, we make an educated guess …

5%

Error rates Runtime (remaining)

Effect size / Baseline CR

-

+13%, 16%

1,600

Instead of: 52,600 - 7,200 = 45,400

… but then the improvement turns out to be better …

5%

Error rates Runtime (remaining)

Effect size / Baseline CR

-

+2%, 8%

> 100,000

… or a lot worse.

Recap

• The Fundamental Tradeoff of A/B Testing affects you no matter what testing platform you use.

-If you want to detect a 5% Improvement on a 10% baseline conversion rate, you should be prepared to wait for at least 50,000 visitors

• Optimizely’s Stats Engine is the only platform that allows you to adjust the trade-off in real time while still reporting valid error rates

1. What makes a good hypothesis?

2. What are the differences between False Positive Rate and False Discovery Rate?

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

At the end of this talk, you should be able to answer

1. What makes a good hypothesis?

2. What are the differences between False Positive Rate and False Discovery Rate?

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

All three are inversely related. For example, running my tests longer can get me lower error rates, or detect smaller effects.

At the end of this talk, you should be able to answer

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

Statistics in 40 Minutes: A/B Testing FundamentalsLeo PekelisStatistician, Optimizely

@lpekelisleonid@optimizely.com

#opticon2015

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing

Definition:

A confidence interval is a range of values for your metric (revenue, conversion rate, etc.) that is 90%* likely to contain the true difference between your variation and baseline.

15.41

11.4Middle Ground

Best Case

Worst case

7.29

This is true regardless of your significance.

http://www.nba.com/

We can’t wait for significance

The confidence interval tells us what we need to know

A confidence interval is the mirror image of statistical significance

Mathematical Definition:

The set of parameter values X so that a hypothesis test with null hypothesis

H0: Removing a distracting header will result in X more revenue per visitor.

is not yet rejected.

Error rate 3: False negative rate

• False negative rate (Type II error)

= “Rate of false negatives from all hypotheses that could have been false negatives.”

“True” value of hypothesis

Improve-ment No effect

Outcome of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False negative rate (Type II error)

= “Rate of false negatives from all hypotheses that could have been false negatives.”

“True” value of hypothesis

Improve-ment No effect

Outcome of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False negative rate (Type II error)

= “Rate of false negatives from all hypotheses that could have been false negatives.”

“True” value of hypothesis

Improve-ment No effect

Outcome of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False negative rate (Type II error)

= “Rate of false negatives from all variations with an improvement on a goal.”

= #(False negative) / #(Improve)

• Thresholding Type II error

= “When have a goal on a variation with an effect, you miss it less than 10% of the time.”

“True” value of hypothesis

Improve-ment No effect

Outcome of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative