Foundations of large-scale “doubly-sequential”...

Foundations of large-scale“doubly-sequential” experimentation

Aaditya Ramdas

Assistant ProfessorDept. of Statistics and Data Science Machine Learning Dept.Carnegie Mellon University

www.stat.cmu.edu/~aramdas/kdd19/

(KDD tutorial in Anchorage, on 4 Aug 2019)

A/B-testing : tech :: clinical trials : pharma

Large-scale A/B-testing or other relatedforms of randomized experimentation hasrevolutionized the tech industry in the last 15yrs.



In 2013, a team from Microsoft (Bing) claimed that they run tens of thousands of such experiments, leading to millions of dollars in increased revenue.

Kohavi et al. ’13



In 2013, a team from Microsoft (Bing) claimed that they run tens of thousands of such experiments, leading to millions of dollars in increased revenue.

Kohavi et al. ’13

Much has been discussed about doing A/B testing the “right” way,both theoretically and practically in real-world systems.Many companies contributing to this vast and growing literature.


Audience poll (for the speaker)

How many of you have written papers on A/B testing or online experimentation?(or work in the area, or consider yourselves experts?)



How many of you have read papers on A/B testingand know what it is, but want to know more?



How many of you have read papers on A/B testingand know what it is, but want to know more?


How many have no idea what I’m talking about?

Users of app or website

Sign up!

Sign up!……

50% 50%

A B


Sign up!

Sign up!……

50% 50%

44 conversions 71 conversions

A B


Sign up!

Sign up!……

50% 50%

44 conversions 71 conversionsB wins!

A B

What we will NOT cover today


Pre-experiment analysis and difference-in-differences


Pre-experiment analysis and difference-in-differencesWhat makes a good metric? (directionality and sensitivity)


Pre-experiment analysis and difference-in-differencesWhat makes a good metric? (directionality and sensitivity)Combining metrics into an overall evaluation criterion


Pre-experiment analysis and difference-in-differencesWhat makes a good metric? (directionality and sensitivity)Combining metrics into an overall evaluation criterionTime-series aspects: dealing with periodicity and trends


Pre-experiment analysis and difference-in-differencesWhat makes a good metric? (directionality and sensitivity)Combining metrics into an overall evaluation criterionTime-series aspects: dealing with periodicity and trendsCausal inference in observational studies



Parametric methods for A/B testing (like SPRT and variants)



Parametric methods for A/B testing (like SPRT and variants)Bayesian A/B testing



Parametric methods for A/B testing (like SPRT and variants)Bayesian A/B testingSide effects and risks associated with running experiments



Parametric methods for A/B testing (like SPRT and variants)Bayesian A/B testingSide effects and risks associated with running experimentsDeployment with controlled, phased rollouts




Pitfalls of long experiments (survivorship bias, perceived trends)




Pitfalls of long experiments (survivorship bias, perceived trends)ML meets causal inference meets online experiments




Pitfalls of long experiments (survivorship bias, perceived trends)ML meets causal inference meets online experimentsExperimentation in marketplaces or with network effects




Pitfalls of long experiments (survivorship bias, perceived trends)ML meets causal inference meets online experimentsExperimentation in marketplaces or with network effectsEthical aspects of running experiments

There are many resources for these topics

Yandex tutorial at The Web Conference ’18

Microsoft tutorial at The Web Conference ’19 (+ ExP Platform webpage)Blog posts by Evan Miller, Etsy, Optimizely, etc.

…

A new “doubly-sequential” perspective: a sequence of sequential experiments

Time / Samples

Exp.


Time / Samples

Exp.

A/B tests


Time / Samples

Exp.


Time / Samples

Exp.

common control

treatments


Time / Samples

Exp.


Zrnic, Ramdas, Jordan ‘18Yang, Ramdas, Jamieson, Wainwright ‘17

What kind of guarantees would we like for doubly sequential experimentation?


(a) inner sequential process (a single experiment) — correct inference when experiment ends (correct p-values for A/B test or correct confidence intervals for treatment effect)


(a) inner sequential process (a single experiment) — correct inference when experiment ends (correct p-values for A/B test or correct confidence intervals for treatment effect)

(b) outer sequential process (multiple experiments) — less clear (is error control on inner process enough?!)

Some potential issues within each experiment

Some potential issues across experiments

Some existing problems in practice

Many other concerns as well




(a) continuous monitoring (b) flexible experiment horizon (c) arbitrary stopping (or continuation) rules





(a) continuous monitoring (b) flexible experiment horizon (c) arbitrary stopping (or continuation) rules

(a) selection bias (multiplicity) (b) dependence across experiments (c) don’t know future outcomes


Solutions for these issues

Inner sequential process:

“confidence sequence” for estimation also called “anytime confidence intervals” (correspondingly, “always valid p-values” for testing)


Part I



Outer sequential process:

“false coverage rate” for estimation (correspondingly, “false discovery rate” for testing)


Part II

Part I






Part II

Part I

Modular solutions: fit well together Many extensions to each piece

Part III

The INNER Sequential Process(a single experiment)

[1 hour]

Part I

The “duality” between confidence intervals and p-values

Hypothesis testing is like stochastic proof by contradiction.


Null hypothesis: The coin is fair (bias = 0)

Alternative:Coin is biased towards H


1000 tosses

0 200 400 600 800

Tails Heads




1000 tosses

0 200 400 600 800

Tails Heads



Apparent contradiction!Should we reject the null hypothesis?

Calculate p-value

Possible observations

Calculate p-value

Possible observationsall

tailsall

heads

Calculate p-value


tailsall

heads

Prob.density

Calculate p-value


tailsall

heads

Prob.density

data

Calculate p-value


tailsall

heads

Prob.density

data

p-value P

Calculate p-value


tailsall

heads

Prob.density

data

p-value P

Calculate p-value

Reject null if P ≤ α


tailsall

heads

Prob.density

data

p-value P

Calculate p-value

Reject null if P ≤ α ≈ #H − #T ≥ 2N log(1/α) .


tailsall

heads

Prob.density

data

p-value P

Calculate p-value

Then, Pr(false positive) ≤ α .

Reject null if P ≤ α ≈ #H − #T ≥ 2N log(1/α) .

An equivalent view via confidence intervals


Estimate the coin bias by ̂μ :=#H − #T

N.


An asymptotic (1 − α)-CI for μ is given by ( ̂μ −z1−α

N, ̂μ +

z1−α

N ),


N.


(appealing to the Central Limit Theorem) where z1−α is the (1 − α)-quantile of N(0,1) .


N, ̂μ +

z1−α

N ),


N.


If this confidence interval does not contain 0,we may be reasonably confident that the coin is biased,and we may reject the null hypothesis.



N, ̂μ +

z1−α

N ),


N.


If this confidence interval does not contain 0,we may be reasonably confident that the coin is biased,and we may reject the null hypothesis.



N, ̂μ +

z1−α

N ),


N.

≈ #H − #T ≥ 2N log(1/α) .

For any parameter µ of interest,

with associated estimator bµ,the following claim holds:

bµ( )

R

| {z }(1�↵) confidence interval

for µ



bµ( )

R


µ0

for µ



bµ( )

R


µ0⌘

for µ



bµ( )

R


µ0⌘

For H0 : µ = µ0

for µ



bµ( )

R


µ0⌘

For H0 : µ = µ0

we have p-value Pµ0 ↵

for µ



bµ( )

R


µ0⌘

For H0 : µ = µ0


for µ

(we would reject the nullhypothesis at level ↵)



bµ( )

R


µ0⌘

For H0 : µ = µ0


for µ


(1�↵/2) confidence intervalz }| {

( )



bµ( )

R


µ0⌘

For H0 : µ = µ0


for µ


(1�↵/2) confidence intervalz }| {

( )

⌘ Pµ0 > ↵/2

In summary, tests (p-values) and CIs are “dual”.


family of tests for θ → CI for θ

CI for θ → family of tests for θ


A (1 − α)-CI for a parameter θ is the set of all θ0 such that the test for H0 : θ = θ0 has p-value larger than α .




A (1 − α)-CI for a parameter θ is the set of all θ0 such that

the smallest q for which the (1 − q)-CI for θ fails to cover θ0 .A p-value for testing the null H0 : θ = θ0 can be given by

the test for H0 : θ = θ0 has p-value larger than α .









CI for θ → composite tests for θ







the smallest q for which the (1 − q)-CI for θ fails to intersect Θ0 .A p-value for testing the null H0 : θ ∈ Θ0 can be given byCI for θ → composite tests for θ







the smallest q for which the (1 − q)-CI for θ fails to intersect Θ0 .A p-value for testing the null H0 : θ ∈ Θ0 can be given byCI for θ → composite tests for θ

Both of them are useful tools to estimate uncertainty,and like any other tool, they can be used well, or be misused.

However, commonly taught confidence intervalsand p-values are only valid (correctly control error)

if the sample size is fixed in advance.

Start

High-level caricature of an A/B-test

Collect more data (increase sample size)

Start


Collect more data (increase sample size) Check if P(n) ≤ α

“peek”Start



“peek”

Stop,Report

“optional stopping”

Start



“peek”

“optional continuation”Stop,

Report


Start



“peek”

“optional continuation”Stop,

Report


Start


With commonly-taught p-values, false positive rate ≫ α .

After 10 people

After 10 people

After 284 people

After 10 people

After 284 people

After 1214 people

After 10 people

After 284 people

After 1214 people

After 2398 people

After 10 people

After 284 people

After 1214 people

After 2398 people

After 7224 people

After 10 people

After 284 people

After 1214 people

After 2398 people

After 7224 people

After 11,219 people, STOP!

Let P(n) be a classical p-value (eg: t-test), calculated using the first n samples.


Under the null hypothesis (no treatment effect),

∀n ≥ 1, Pr(P(n) ≤ α)

prob. of false positive

≤ α .


Let τ be the stopping time of the experiment.


∀n ≥ 1, Pr(P(n) ≤ α)


≤ α .

Often, τ depends on data, eg: τ := min{n ∈ ℕ : Pn ≤ α} .




∀n ≥ 1, Pr(P(n) ≤ α)


≤ α .

Unfortunately, Pr(P(τ) ≤ α) ≰ α .

Often, τ depends on data, eg: τ := min{n ∈ ℕ : Pn ≤ α} .

In other words, Pr(∃n ∈ ℕ : P(n) ≤ α) ≫ α .

Collect more data (increase sample size) Check if 0 ∉ (1 − α) CI

“peek”

“optional continuation”

Stop


Start

Same problem with confidence interval (CI)

Again, false positive rate ≫ α .

Stop,Report

Let (L(n), U(n)) be any classical (1 − α) CI, calculated using the first n samples (eg: CLT).


When trying to estimate the treatment effect θ,

∀n ≥ 1, Pr(θ ∈ (L(n), U(n)))

prob. of coverage

≥ 1 − α .




∀n ≥ 1, Pr(θ ∈ (L(n), U(n)))

prob. of coverage

≥ 1 − α .

Unfortunately, Pr(θ ∈ (L(τ), U(τ))) ≱ 1 − α .

Again, τ may depend on data, eg: τ := min{n ∈ ℕ : L(n) > 0} .




∀n ≥ 1, Pr(θ ∈ (L(n), U(n)))

prob. of coverage

≥ 1 − α .

Unfortunately, Pr(θ ∈ (L(τ), U(τ))) ≱ 1 − α .

Again, τ may depend on data, eg: τ := min{n ∈ ℕ : L(n) > 0} .

In other words, Pr(∀n ≥ 1 : θ ∈ (L(n), U(n))) ≪ 1 − α .usually = 0.

Solution: “confidence sequence”(aka “anytime confidence intervals”)

or “sequential p-values” for testing(aka “always-valid p-values”)

A “confidence sequence” for a parameteris a sequence of confidence intervalswith a uniform (simultaneous) coverage guarantee.

✓<latexit sha1_base64="JqEnYvV6PtsKBJYmBVwEpjIMANw=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==</latexit><latexit sha1_base64="JqEnYvV6PtsKBJYmBVwEpjIMANw=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==</latexit><latexit sha1_base64="JqEnYvV6PtsKBJYmBVwEpjIMANw=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==</latexit><latexit sha1_base64="JqEnYvV6PtsKBJYmBVwEpjIMANw=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==</latexit>

(Ln, Un)<latexit sha1_base64="PpnXScjDZFhd9U2EC1D0fZEkPzE=">AAAB8XicbVBNS8NAEJ3Ur1q/oh69LBahgpREBD0WvXjwUMG0xTaEzXbTLt1swu5GKKH/wosHRbz6b7z5b9y2OWjrg4HHezPMzAtTzpR2nG+rtLK6tr5R3qxsbe/s7tn7By2VZJJQjyQ8kZ0QK8qZoJ5mmtNOKimOQ07b4ehm6refqFQsEQ96nFI/xgPBIkawNtJj7S4QZ8gLxGlgV526MwNaJm5BqlCgGdhfvX5CspgKTThWqus6qfZzLDUjnE4qvUzRFJMRHtCuoQLHVPn57OIJOjFKH0WJNCU0mqm/J3IcKzWOQ9MZYz1Ui95U/M/rZjq68nMm0kxTQeaLoowjnaDp+6jPJCWajw3BRDJzKyJDLDHRJqSKCcFdfHmZtM7rrlN37y+qjesijjIcwTHUwIVLaMAtNMEDAgKe4RXeLGW9WO/Wx7y1ZBUzh/AH1ucP2MyPtg==</latexit><latexit sha1_base64="PpnXScjDZFhd9U2EC1D0fZEkPzE=">AAAB8XicbVBNS8NAEJ3Ur1q/oh69LBahgpREBD0WvXjwUMG0xTaEzXbTLt1swu5GKKH/wosHRbz6b7z5b9y2OWjrg4HHezPMzAtTzpR2nG+rtLK6tr5R3qxsbe/s7tn7By2VZJJQjyQ8kZ0QK8qZoJ5mmtNOKimOQ07b4ehm6refqFQsEQ96nFI/xgPBIkawNtJj7S4QZ8gLxGlgV526MwNaJm5BqlCgGdhfvX5CspgKTThWqus6qfZzLDUjnE4qvUzRFJMRHtCuoQLHVPn57OIJOjFKH0WJNCU0mqm/J3IcKzWOQ9MZYz1Ui95U/M/rZjq68nMm0kxTQeaLoowjnaDp+6jPJCWajw3BRDJzKyJDLDHRJqSKCcFdfHmZtM7rrlN37y+qjesijjIcwTHUwIVLaMAtNMEDAgKe4RXeLGW9WO/Wx7y1ZBUzh/AH1ucP2MyPtg==</latexit><latexit sha1_base64="PpnXScjDZFhd9U2EC1D0fZEkPzE=">AAAB8XicbVBNS8NAEJ3Ur1q/oh69LBahgpREBD0WvXjwUMG0xTaEzXbTLt1swu5GKKH/wosHRbz6b7z5b9y2OWjrg4HHezPMzAtTzpR2nG+rtLK6tr5R3qxsbe/s7tn7By2VZJJQjyQ8kZ0QK8qZoJ5mmtNOKimOQ07b4ehm6refqFQsEQ96nFI/xgPBIkawNtJj7S4QZ8gLxGlgV526MwNaJm5BqlCgGdhfvX5CspgKTThWqus6qfZzLDUjnE4qvUzRFJMRHtCuoQLHVPn57OIJOjFKH0WJNCU0mqm/J3IcKzWOQ9MZYz1Ui95U/M/rZjq68nMm0kxTQeaLoowjnaDp+6jPJCWajw3BRDJzKyJDLDHRJqSKCcFdfHmZtM7rrlN37y+qjesijjIcwTHUwIVLaMAtNMEDAgKe4RXeLGW9WO/Wx7y1ZBUzh/AH1ucP2MyPtg==</latexit><latexit sha1_base64="PpnXScjDZFhd9U2EC1D0fZEkPzE=">AAAB8XicbVBNS8NAEJ3Ur1q/oh69LBahgpREBD0WvXjwUMG0xTaEzXbTLt1swu5GKKH/wosHRbz6b7z5b9y2OWjrg4HHezPMzAtTzpR2nG+rtLK6tr5R3qxsbe/s7tn7By2VZJJQjyQ8kZ0QK8qZoJ5mmtNOKimOQ07b4ehm6refqFQsEQ96nFI/xgPBIkawNtJj7S4QZ8gLxGlgV526MwNaJm5BqlCgGdhfvX5CspgKTThWqus6qfZzLDUjnE4qvUzRFJMRHtCuoQLHVPn57OIJOjFKH0WJNCU0mqm/J3IcKzWOQ9MZYz1Ui95U/M/rZjq68nMm0kxTQeaLoowjnaDp+6jPJCWajw3BRDJzKyJDLDHRJqSKCcFdfHmZtM7rrlN37y+qjesijjIcwTHUwIVLaMAtNMEDAgKe4RXeLGW9WO/Wx7y1ZBUzh/AH1ucP2MyPtg==</latexit>

Sample sizeℙ(∀n ≥ 1 : θ ∈ (Ln, Un)) ≥ 1 − α .

A “confidence sequence” for a parameteris a sequence of confidence intervalswith a uniform (simultaneous) coverage guarantee.

Darling, Robbins ’67, ‘68Lai ’76, ’84

Howard, Ramdas, McAuliffe, Sekhon ’18

✓<latexit sha1_base64="JqEnYvV6PtsKBJYmBVwEpjIMANw=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==</latexit><latexit sha1_base64="JqEnYvV6PtsKBJYmBVwEpjIMANw=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==</latexit><latexit sha1_base64="JqEnYvV6PtsKBJYmBVwEpjIMANw=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==</latexit><latexit sha1_base64="JqEnYvV6PtsKBJYmBVwEpjIMANw=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==</latexit>

(Ln, Un)<latexit sha1_base64="PpnXScjDZFhd9U2EC1D0fZEkPzE=">AAAB8XicbVBNS8NAEJ3Ur1q/oh69LBahgpREBD0WvXjwUMG0xTaEzXbTLt1swu5GKKH/wosHRbz6b7z5b9y2OWjrg4HHezPMzAtTzpR2nG+rtLK6tr5R3qxsbe/s7tn7By2VZJJQjyQ8kZ0QK8qZoJ5mmtNOKimOQ07b4ehm6refqFQsEQ96nFI/xgPBIkawNtJj7S4QZ8gLxGlgV526MwNaJm5BqlCgGdhfvX5CspgKTThWqus6qfZzLDUjnE4qvUzRFJMRHtCuoQLHVPn57OIJOjFKH0WJNCU0mqm/J3IcKzWOQ9MZYz1Ui95U/M/rZjq68nMm0kxTQeaLoowjnaDp+6jPJCWajw3BRDJzKyJDLDHRJqSKCcFdfHmZtM7rrlN37y+qjesijjIcwTHUwIVLaMAtNMEDAgKe4RXeLGW9WO/Wx7y1ZBUzh/AH1ucP2MyPtg==</latexit><latexit sha1_base64="PpnXScjDZFhd9U2EC1D0fZEkPzE=">AAAB8XicbVBNS8NAEJ3Ur1q/oh69LBahgpREBD0WvXjwUMG0xTaEzXbTLt1swu5GKKH/wosHRbz6b7z5b9y2OWjrg4HHezPMzAtTzpR2nG+rtLK6tr5R3qxsbe/s7tn7By2VZJJQjyQ8kZ0QK8qZoJ5mmtNOKimOQ07b4ehm6refqFQsEQ96nFI/xgPBIkawNtJj7S4QZ8gLxGlgV526MwNaJm5BqlCgGdhfvX5CspgKTThWqus6qfZzLDUjnE4qvUzRFJMRHtCuoQLHVPn57OIJOjFKH0WJNCU0mqm/J3IcKzWOQ9MZYz1Ui95U/M/rZjq68nMm0kxTQeaLoowjnaDp+6jPJCWajw3BRDJzKyJDLDHRJqSKCcFdfHmZtM7rrlN37y+qjesijjIcwTHUwIVLaMAtNMEDAgKe4RXeLGW9WO/Wx7y1ZBUzh/AH1ucP2MyPtg==</latexit><latexit sha1_base64="PpnXScjDZFhd9U2EC1D0fZEkPzE=">AAAB8XicbVBNS8NAEJ3Ur1q/oh69LBahgpREBD0WvXjwUMG0xTaEzXbTLt1swu5GKKH/wosHRbz6b7z5b9y2OWjrg4HHezPMzAtTzpR2nG+rtLK6tr5R3qxsbe/s7tn7By2VZJJQjyQ8kZ0QK8qZoJ5mmtNOKimOQ07b4ehm6refqFQsEQ96nFI/xgPBIkawNtJj7S4QZ8gLxGlgV526MwNaJm5BqlCgGdhfvX5CspgKTThWqus6qfZzLDUjnE4qvUzRFJMRHtCuoQLHVPn57OIJOjFKH0WJNCU0mqm/J3IcKzWOQ9MZYz1Ui95U/M/rZjq68nMm0kxTQeaLoowjnaDp+6jPJCWajw3BRDJzKyJDLDHRJqSKCcFdfHmZtM7rrlN37y+qjesijjIcwTHUwIVLaMAtNMEDAgKe4RXeLGW9WO/Wx7y1ZBUzh/AH1ucP2MyPtg==</latexit><latexit sha1_base64="PpnXScjDZFhd9U2EC1D0fZEkPzE=">AAAB8XicbVBNS8NAEJ3Ur1q/oh69LBahgpREBD0WvXjwUMG0xTaEzXbTLt1swu5GKKH/wosHRbz6b7z5b9y2OWjrg4HHezPMzAtTzpR2nG+rtLK6tr5R3qxsbe/s7tn7By2VZJJQjyQ8kZ0QK8qZoJ5mmtNOKimOQ07b4ehm6refqFQsEQ96nFI/xgPBIkawNtJj7S4QZ8gLxGlgV526MwNaJm5BqlCgGdhfvX5CspgKTThWqus6qfZzLDUjnE4qvUzRFJMRHtCuoQLHVPn57OIJOjFKH0WJNCU0mqm/J3IcKzWOQ9MZYz1Ui95U/M/rZjq68nMm0kxTQeaLoowjnaDp+6jPJCWajw3BRDJzKyJDLDHRJqSKCcFdfHmZtM7rrlN37y+qjesijjIcwTHUwIVLaMAtNMEDAgKe4RXeLGW9WO/Wx7y1ZBUzh/AH1ucP2MyPtg==</latexit>

Sample sizeℙ(∀n ≥ 1 : θ ∈ (Ln, Un)) ≥ 1 − α .

Example: tracking the mean of a Gaussianor Bernoulli from i.i.d. observations.

X1, X2, … ∼ N(θ,1) or Ber(θ)


X1, X2, … ∼ N(θ,1) or Ber(θ)

Producing a confidence interval at a fixed timeis elementary statistics (~100 years old).


X1, X2, … ∼ N(θ,1) or Ber(θ)

Producing a confidence interval at a fixed timeis elementary statistics (~100 years old).

How do we produce a confidence sequence?(which is like a confidence band over time)

Empirical mean

−1.0

−0.5

0.0

0.5

1.0

101 102 103 104 105

Number of samples, t

Con

fiden

ce b

ounds

0.0

0.2

0.4

0.6

101 102 103 104 105


Cum

ula

tive

mis

cove

rage

pro

b.

Pointwise CLT Pointwise Hoeffding Linear boundary Curved boundary

(Fair coin)

Empirical mean

−1.0

−0.5

0.0

0.5

1.0

101 102 103 104 105


Con

fiden

ce b

ounds

0.0

0.2

0.4

0.6

101 102 103 104 105


Cum

ula

tive

mis

cove

rage

pro

b.

Pointwise CLT Pointwise Hoeffding Linear boundary Curved boundary

Empirical mean

−1.0

−0.5

0.0

0.5

1.0

101 102 103 104 105


Con

fiden

ce b

ounds

0.0

0.2

0.4

0.6

101 102 103 104 105


Cum

ula

tive

mis

cove

rage

pro

b.

Pointwise CLT Pointwise Hoeffding Linear boundary Curved boundaryAnytime CIPointwise CI (CLT)

Eg:∑n

i=1 Xi

n± 1.71

log log(2n) + 0.72 log(5.19/α)n

If Xi is 1-subGaussian, then

is a (1 − α) confidence sequence.

Hoeffding boundCLT bound

Jamieson et al. (2013)Balsubramani (2014)Zhao et al. (2016)Darling & Robbins (1967b)Kaufmann et al. (2014)Normal mixtureDarling & Robbins (1968)Polynomial stitching (ours)Inverted stitching (ours)Discrete mixture (ours)

0

2,000

0 105

Vt

Bou

ndar

y

Eg:∑n

i=1 Xi

n± 1.71

log log(2n) + 0.72 log(5.19/α)n



ℙ( ⋃n∈ℕ

{θ ∉ (Ln, Un)}) ≤ α .

Some implications:

ℙ( ⋃n∈ℕ

{θ ∉ (Ln, Un)}) ≤ α .

Some implications:

1. Valid inference at any time, even stopping times:

ℙ( ⋃n∈ℕ

{θ ∉ (Ln, Un)}) ≤ α .

Some implications:


ℙ( ⋃n∈ℕ

{θ ∉ (Ln, Un)}) ≤ α .

For any stopping time τ : ℙ(θ ∉ (Lτ, Uτ)) ≤ α .

Some implications:


2. Valid post-hoc inference (in hindsight):

ℙ( ⋃n∈ℕ

{θ ∉ (Ln, Un)}) ≤ α .


Some implications:



ℙ( ⋃n∈ℕ

{θ ∉ (Ln, Un)}) ≤ α .


For any random time T : ℙ(θ ∉ (LT, UT)) ≤ α .

Some implications:



3. No pre-specified sample size: can extend or stop experiments adaptively.

ℙ( ⋃n∈ℕ

{θ ∉ (Ln, Un)}) ≤ α .


For any random time T : ℙ(θ ∉ (LT, UT)) ≤ α .

The same duality between confidence intervals and p-valuesalso holds in the sequential setting:“confidence sequences” are dual to“always valid p-values”.

Duality between anytime p-value and CI

Define a set of null values ℋ0 for θ .



Let P(n) := inf{α : the (1 − α) CI(n) does not intersect ℋ0}




If CI(n) is a pointwise CI then P(n) is a classical p-value .

For all fixed times n, Pr(P(n) ≤ α)


≤ α .




If CI(n) is a pointwise CI then P(n) is a classical p-value .

If CI(n) is an anytime CI then P(n) is an always-valid p-value .

For all fixed times n, Pr(P(n) ≤ α)


≤ α .

For all stopping times τ, Pr(P(τ) ≤ α) ≤ α .


For all data-dependent times T, Pr(P(T) ≤ α) ≤ α .

Relationship to Sequential Probability Ratio Test

we want to test a null hypothesis H0 : θ = θ0

against an alternative hypothesis H1 : θ = θ1 .

Given a stream of data X1, X2, … ∼ fθ, suppose

Wald ‘48





Wald's SPRT (or SLRT) calculates a probability/likelihood ratio:

L(n) :=∏n

i=1 f1(Xi)

∏ni=1 f0(Xi)

,

and rejects when L(n) > 1/α . Can also use prior/mixture over θ1 .

Wald ‘48






L(n) :=∏n

i=1 f1(Xi)

∏ni=1 f0(Xi)

,


Equivalently, define P(n) = 1/L(n) . Then P(n) is an always-valid p-value.

Wald ‘48






L(n) :=∏n

i=1 f1(Xi)

∏ni=1 f0(Xi)

,


Equivalently, define P(n) = 1/L(n) . Then P(n) is an always-valid p-value.

(And inverting it defines a confidence sequence.) Wald ‘48

Can construct confidence sequences(and hence always valid p-values)in a wide variety of nonparametric settings(eg: random variables that arebounded, or subGaussian, or subexponential)





Part I






Part II

Part I






Part II

Part I


Part III

The OUTER Sequential Process(a sequence of experiments)

[40 mins]

Part I1

Quick recap of A/B testing

A :


A :

B :


Null hypothesis: A is at least

as good as B.

A :

B :


0 200 400 600 800

Misses Clicks


as good as B.

A :

B :


0 200 400 600 800

Misses Clicks


as good as B.

A :

B :

P = Pr(observed data or moreextreme, assuming null is true)

Calculate p-value:


0 200 400 600 800

Misses Clicks


as good as B.

A :

B :


Decision rule : if , then we reject the null

(“discovery”). We change A to B,

ensuring that type-1 error .

Calculate p-value:

P ↵

P ↵


0 200 400 600 800

Misses Clicks


as good as B.

A :

B :


Decision rule : if , then we reject the null

(“discovery”). We change A to B,

ensuring that type-1 error .

Calculate p-value:

a wrong rejection of the null

is a false discovery and implies

a bad change from A to B.

P ↵

P ↵

Reality: internet companies run thousands of different (independent) A/B tests over time.


Time


Time

vs. Color


Time

vs. ColorDecision rule:


Time


P1 ↵?


Time

vs.

vs.

Color

Size

Decision rule:P1 ↵?


Time

vs.

vs.

Color

Size


P2 ↵?


Time

vs.

vs.

vs.

Color

Size

Orientation


P2 ↵?


Time

vs.

vs.

vs.

Color

Size

Orientation


P2 ↵?

P3 ↵?


Time

vs.

vs.

vs.

vs.

Color

Size

Orientation

Style


P2 ↵?

P3 ↵?


Time

vs.

vs.

vs.

vs.

Color

Size

Orientation

Style


P2 ↵?

P3 ↵?

P4 ↵?


Time

vs.

vs.

vs.

vs.

vs.

Color

Size

Orientation

Style

Logo


P2 ↵?

P3 ↵?

P4 ↵?


Time

vs.

vs.

vs.

vs.

vs.

Color

Size

Orientation

Style

Logo


P2 ↵?

P3 ↵?

P4 ↵?

P5 ↵?


Time

vs.

vs.

vs.

vs.

vs.

Color

Size

Orientation

Style

Logo

Decision rule:

Problem!

P1 ↵?

P2 ↵?

P3 ↵?

P4 ↵?

P5 ↵?

Run 10,000 different,

independentA/B tests



9,900 true nulls

100 non-nulls



9,900 true nulls

100 non-nulls

type-1error rate (per test)= 0.05



9,900 true nulls

100 non-nulls


495 false discoveries



9,900 true nulls

100 non-nulls



power (per test)= 0.80



9,900 true nulls

100 non-nulls



80 true discoveries




9,900 true nulls

100 non-nulls



80 true discoveries

false discovery proportion

FDP = 495/575


FDP = # false discoveries# discoveries



9,900 true nulls

100 non-nulls



80 true discoveries


FDP = 495/575



FDR = E[FDP]



9,900 true nulls

100 non-nulls



80 true discoveries


FDP = 495/575


Summary: FDR can be larger than per-test error rate.(even if hypotheses, tests, data are independent)


FDR = E[FDP]

Given a possibly infinite sequence of independent tests (p-values), can we guarantee control of the FDR in a fully online fashion?

Foster-Stine ’08Aharoni-Rosset ’14

Javanmard-Montanari ’16Ramdas-Yang-Wainwright-Jordan ’17Ramdas-Zrnic-Wainwright-Jordan ’18

Tian-Ramdas ’19

The aim of online FDR procedures

Time

Decision rule:


Time



Time


P1 ↵1?


Time

vs.

vs.

Color

Size

Decision rule:P1 ↵1?


Time

vs.

vs.

Color

Size


P2 ↵2?


Time

vs.

vs.

vs.

Color

Size

Orientation


P2 ↵2?


Time

vs.

vs.

vs.

Color

Size

Orientation


P2 ↵2?

P3 ↵3?


Time

vs.

vs.

vs.

vs.

Color

Size

Orientation

Style


P2 ↵2?

P3 ↵3?


Time

vs.

vs.

vs.

vs.

Color

Size

Orientation

Style


P2 ↵2?

P3 ↵3?

P4 ↵4?


Time

vs.

vs.

vs.

vs.

vs.

Color

Size

Orientation

Style

Logo


P2 ↵2?

P3 ↵3?

P4 ↵4?


Time

vs.

vs.

vs.

vs.

vs.

Color

Size

Orientation

Style

Logo


P2 ↵2?

P3 ↵3?

P4 ↵4?

P5 ↵5?


Time

vs.

vs.

vs.

vs.

vs.

Color

Size

Orientation

Style

Logo

Decision rule:

How do weset each

error level to control FDR at any time?

P1 ↵1?

P2 ↵2?

P3 ↵3?

P4 ↵4?

P5 ↵5?

Offline FDR methodsdo not control the FDRin online settings

Benjamini-Hochberg ’95

One of the most famous offline FDR methods is the “Benjamini-Hochberg” (BH) method

The following method is not a valid online FDR algorithm:

At the end of experiment t, run BH on P1, …, Pt .



The reason is that the decision about the first hypothesis depends on all future hypotheses. We cannot commit to a decision and stick to it.



The reason is that the decision about the first hypothesis depends on all future hypotheses. We cannot commit to a decision and stick to it.

We need the error level αt for experiment tto be specified when it starts, and we need to make a final decision when experiment t ends.

This multiple testing issue is not particular to p-values. It also exists when selectively reporting treatment effectswith confidence intervals.

Benjamini, Yekutieli ’05Weinstein, Ramdas ’19

Multiplicity in reported CIs

One rarely cares about all CIs or follows-up on them,one usually reports only the most “promising” CIs.

False coverage proportion


FCP =# incorrectly reported CIs

# reported CIs


False coverage proportion


FCP =# incorrectly reported CIs

# reported CIs

FCR = 𝔼[FCP]

False coverage rate


Benjamini-Yekutieli ’06Weinstein-Yekutieli ’14

Fithian et al. ’14

Controlling FCR is nontrivial

Constructing marginal 95% CIs for all parametersfails to control FCR at 0.05.

<latexit sha1_base64="FMP/KogEvHFO2tjJge5o3+g3UBs=">AAACP3icbVA9TyMxFPTyfeErHCXNEwkSVbSLhIAOEekEHSASkLJR9NbxBguvvbLfIkUR/4zm/sJ1tDQUnE60dOcNKfh61WjmPXtmklxJR2H4EExNz8zOzS/8qCwuLa+sVtd+tp0pLBctbpSxVwk6oaQWLZKkxFVuBWaJEpfJTbPUL2+FddLoCxrmopvhQMtUciRP9artptGObMFJ6gFkaAdSo4L6wW68VYfmiYPUWEClIEeLmSD/VhxDJUWpHJABbjRZo+BX8xyQoB42wt16o1eteTAe+AqiCaixyZz2qn/ivuFFJjRxhc51ojCn7ggtSa7EXSUunMiR3+BAdDzU3orrjsb572DLM/2x0dS7gTH7/mKEmXPDLPGbGdK1+6yV5Hdap6B0vzuSOi9IaP72UVqoMndZJvSlFZzU0APkVnqvwK99T7ysqeJLiD5H/graO40obERnO7XDo0kdC2yDbbJtFrE9dsiO2SlrMc7u2SN7Zn+D38FT8C94eVudCiY36+zDBK//AfwqrFs=</latexit><latexit sha1_base64="FMP/KogEvHFO2tjJge5o3+g3UBs=">AAACP3icbVA9TyMxFPTyfeErHCXNEwkSVbSLhIAOEekEHSASkLJR9NbxBguvvbLfIkUR/4zm/sJ1tDQUnE60dOcNKfh61WjmPXtmklxJR2H4EExNz8zOzS/8qCwuLa+sVtd+tp0pLBctbpSxVwk6oaQWLZKkxFVuBWaJEpfJTbPUL2+FddLoCxrmopvhQMtUciRP9artptGObMFJ6gFkaAdSo4L6wW68VYfmiYPUWEClIEeLmSD/VhxDJUWpHJABbjRZo+BX8xyQoB42wt16o1eteTAe+AqiCaixyZz2qn/ivuFFJjRxhc51ojCn7ggtSa7EXSUunMiR3+BAdDzU3orrjsb572DLM/2x0dS7gTH7/mKEmXPDLPGbGdK1+6yV5Hdap6B0vzuSOi9IaP72UVqoMndZJvSlFZzU0APkVnqvwK99T7ysqeJLiD5H/graO40obERnO7XDo0kdC2yDbbJtFrE9dsiO2SlrMc7u2SN7Zn+D38FT8C94eVudCiY36+zDBK//AfwqrFs=</latexit><latexit sha1_base64="FMP/KogEvHFO2tjJge5o3+g3UBs=">AAACP3icbVA9TyMxFPTyfeErHCXNEwkSVbSLhIAOEekEHSASkLJR9NbxBguvvbLfIkUR/4zm/sJ1tDQUnE60dOcNKfh61WjmPXtmklxJR2H4EExNz8zOzS/8qCwuLa+sVtd+tp0pLBctbpSxVwk6oaQWLZKkxFVuBWaJEpfJTbPUL2+FddLoCxrmopvhQMtUciRP9artptGObMFJ6gFkaAdSo4L6wW68VYfmiYPUWEClIEeLmSD/VhxDJUWpHJABbjRZo+BX8xyQoB42wt16o1eteTAe+AqiCaixyZz2qn/ivuFFJjRxhc51ojCn7ggtSa7EXSUunMiR3+BAdDzU3orrjsb572DLM/2x0dS7gTH7/mKEmXPDLPGbGdK1+6yV5Hdap6B0vzuSOi9IaP72UVqoMndZJvSlFZzU0APkVnqvwK99T7ysqeJLiD5H/graO40obERnO7XDo0kdC2yDbbJtFrE9dsiO2SlrMc7u2SN7Zn+D38FT8C94eVudCiY36+zDBK//AfwqrFs=</latexit><latexit sha1_base64="FMP/KogEvHFO2tjJge5o3+g3UBs=">AAACP3icbVA9TyMxFPTyfeErHCXNEwkSVbSLhIAOEekEHSASkLJR9NbxBguvvbLfIkUR/4zm/sJ1tDQUnE60dOcNKfh61WjmPXtmklxJR2H4EExNz8zOzS/8qCwuLa+sVtd+tp0pLBctbpSxVwk6oaQWLZKkxFVuBWaJEpfJTbPUL2+FddLoCxrmopvhQMtUciRP9artptGObMFJ6gFkaAdSo4L6wW68VYfmiYPUWEClIEeLmSD/VhxDJUWpHJABbjRZo+BX8xyQoB42wt16o1eteTAe+AqiCaixyZz2qn/ivuFFJjRxhc51ojCn7ggtSa7EXSUunMiR3+BAdDzU3orrjsb572DLM/2x0dS7gTH7/mKEmXPDLPGbGdK1+6yV5Hdap6B0vzuSOi9IaP72UVqoMndZJvSlFZzU0APkVnqvwK99T7ysqeJLiD5H/graO40obERnO7XDo0kdC2yDbbJtFrE9dsiO2SlrMc7u2SN7Zn+D38FT8C94eVudCiY36+zDBK//AfwqrFs=</latexit>


Suppose treatment e↵ect ✓j 2 {±0.1} for all j,and experimental observations are normalized toXj ⇠ N(✓j , 1).

<latexit sha1_base64="gfM/xJEjMu7AD/AFwnxRplluGo0=">AAACeHicbVHLbtNAFB2bVwmPBlh2c0Vc0UpVZGdDlxVsWKEiSBspE1nj8XUz6TysmXHVYOUb+Dd2fAgbVozdIEHLXR2dO0dnzrlFLYXzafojiu/df/Dw0c7jwZOnz57vDl+8PHOmsRyn3EhjZwVzKIXGqRde4qy2yFQh8by4fN/tz6/QOmH0F7+ucaHYhRaV4MwHKh9++9zUtXEIPqi8Qu0Bqwq5h4T6JXqWr6jQtKW1gnSc0U0ClbHApIRklRxRCgOmS8DrGq3o5EyCKRzaq97AAbMI2ljFpPiKJXgDlA4gmeUroE4o+Hjwx+coO0zG+XCUjtN+4C7ItmBEtnOaD7/T0vCms+aSOTfP0tovWma94BI3A9o4rBm/ZBc4D1AzhW7R9sVtYD8wZR+oMiF5z/6taJlybq2K8FIxv3S3dx35v9288dXxohW6bjxqfmNUNbKL310BSmFDx3IdAONWhL8CXzLLuA+3GoQSstuR74KzyTgLJ/k0GZ2829axQ/bIa3JAMvKWnJAP5JRMCSc/o70oifajXzHEb+LDm6dxtNW8Iv9MPPkNyNS8yQ==</latexit><latexit sha1_base64="gfM/xJEjMu7AD/AFwnxRplluGo0=">AAACeHicbVHLbtNAFB2bVwmPBlh2c0Vc0UpVZGdDlxVsWKEiSBspE1nj8XUz6TysmXHVYOUb+Dd2fAgbVozdIEHLXR2dO0dnzrlFLYXzafojiu/df/Dw0c7jwZOnz57vDl+8PHOmsRyn3EhjZwVzKIXGqRde4qy2yFQh8by4fN/tz6/QOmH0F7+ucaHYhRaV4MwHKh9++9zUtXEIPqi8Qu0Bqwq5h4T6JXqWr6jQtKW1gnSc0U0ClbHApIRklRxRCgOmS8DrGq3o5EyCKRzaq97AAbMI2ljFpPiKJXgDlA4gmeUroE4o+Hjwx+coO0zG+XCUjtN+4C7ItmBEtnOaD7/T0vCms+aSOTfP0tovWma94BI3A9o4rBm/ZBc4D1AzhW7R9sVtYD8wZR+oMiF5z/6taJlybq2K8FIxv3S3dx35v9288dXxohW6bjxqfmNUNbKL310BSmFDx3IdAONWhL8CXzLLuA+3GoQSstuR74KzyTgLJ/k0GZ2829axQ/bIa3JAMvKWnJAP5JRMCSc/o70oifajXzHEb+LDm6dxtNW8Iv9MPPkNyNS8yQ==</latexit><latexit sha1_base64="gfM/xJEjMu7AD/AFwnxRplluGo0=">AAACeHicbVHLbtNAFB2bVwmPBlh2c0Vc0UpVZGdDlxVsWKEiSBspE1nj8XUz6TysmXHVYOUb+Dd2fAgbVozdIEHLXR2dO0dnzrlFLYXzafojiu/df/Dw0c7jwZOnz57vDl+8PHOmsRyn3EhjZwVzKIXGqRde4qy2yFQh8by4fN/tz6/QOmH0F7+ucaHYhRaV4MwHKh9++9zUtXEIPqi8Qu0Bqwq5h4T6JXqWr6jQtKW1gnSc0U0ClbHApIRklRxRCgOmS8DrGq3o5EyCKRzaq97AAbMI2ljFpPiKJXgDlA4gmeUroE4o+Hjwx+coO0zG+XCUjtN+4C7ItmBEtnOaD7/T0vCms+aSOTfP0tovWma94BI3A9o4rBm/ZBc4D1AzhW7R9sVtYD8wZR+oMiF5z/6taJlybq2K8FIxv3S3dx35v9288dXxohW6bjxqfmNUNbKL310BSmFDx3IdAONWhL8CXzLLuA+3GoQSstuR74KzyTgLJ/k0GZ2829axQ/bIa3JAMvKWnJAP5JRMCSc/o70oifajXzHEb+LDm6dxtNW8Iv9MPPkNyNS8yQ==</latexit><latexit sha1_base64="gfM/xJEjMu7AD/AFwnxRplluGo0=">AAACeHicbVHLbtNAFB2bVwmPBlh2c0Vc0UpVZGdDlxVsWKEiSBspE1nj8XUz6TysmXHVYOUb+Dd2fAgbVozdIEHLXR2dO0dnzrlFLYXzafojiu/df/Dw0c7jwZOnz57vDl+8PHOmsRyn3EhjZwVzKIXGqRde4qy2yFQh8by4fN/tz6/QOmH0F7+ucaHYhRaV4MwHKh9++9zUtXEIPqi8Qu0Bqwq5h4T6JXqWr6jQtKW1gnSc0U0ClbHApIRklRxRCgOmS8DrGq3o5EyCKRzaq97AAbMI2ljFpPiKJXgDlA4gmeUroE4o+Hjwx+coO0zG+XCUjtN+4C7ItmBEtnOaD7/T0vCms+aSOTfP0tovWma94BI3A9o4rBm/ZBc4D1AzhW7R9sVtYD8wZR+oMiF5z/6taJlybq2K8FIxv3S3dx35v9288dXxohW6bjxqfmNUNbKL310BSmFDx3IdAONWhL8CXzLLuA+3GoQSstuR74KzyTgLJ/k0GZ2829axQ/bIa3JAMvKWnJAP5JRMCSc/o70oifajXzHEb+LDm6dxtNW8Iv9MPPkNyNS8yQ==</latexit>




Suppose we only care about drugs with large e↵ects.So we only pursue phase II of the trial if Xj > 3.

<latexit sha1_base64="eT1OBoZu+q2ozcPHjJOqWpC8g1E=">AAACVXicbVFNbxMxFPQupZTw0QBHLk+kSJxWu+0BTqgqF3orKmkjJVH01vs2a+pdW/Zzqyjqn+wF8U+4IOGkEYKWkSyNZt7Iz+PSauU5z38k6YOth9uPdh73njx99ny3/+LlmTfBSRpKo40blehJq46GrFjTyDrCttR0Xl58Wvnnl+S8Mt1XXliatjjvVK0kcpRmfX0arDWe4IrAdHoBEh0BliYwVC7MPVwpbkCjmxNQXZNkn00mvVPzJ2GD84HANnENOD4GUwM3BOwUalA17I1m3+AjHOxls/4gz/I14D4pNmQgNjiZ9W8mlZGhpY6lRu/HRW55ukTHSmq67k2CJ4vyAuc0jrTDlvx0uW7lGt5GpYLauHg6hrX6d2KJrfeLtoyTLXLj73or8X/eOHD9YbpUnQ1Mnby9qA4a2MCqYqiUizXFZiqF0qm4K8gGHUqOH9GLJRR3n3yfnO1nRZ4VX/YHh0ebOnbEa/FGvBOFeC8OxWdxIoZCihvxM0mSNPme/Eq30u3b0TTZZF6Jf5Du/gY/R7EV</latexit><latexit sha1_base64="eT1OBoZu+q2ozcPHjJOqWpC8g1E=">AAACVXicbVFNbxMxFPQupZTw0QBHLk+kSJxWu+0BTqgqF3orKmkjJVH01vs2a+pdW/Zzqyjqn+wF8U+4IOGkEYKWkSyNZt7Iz+PSauU5z38k6YOth9uPdh73njx99ny3/+LlmTfBSRpKo40blehJq46GrFjTyDrCttR0Xl58Wvnnl+S8Mt1XXliatjjvVK0kcpRmfX0arDWe4IrAdHoBEh0BliYwVC7MPVwpbkCjmxNQXZNkn00mvVPzJ2GD84HANnENOD4GUwM3BOwUalA17I1m3+AjHOxls/4gz/I14D4pNmQgNjiZ9W8mlZGhpY6lRu/HRW55ukTHSmq67k2CJ4vyAuc0jrTDlvx0uW7lGt5GpYLauHg6hrX6d2KJrfeLtoyTLXLj73or8X/eOHD9YbpUnQ1Mnby9qA4a2MCqYqiUizXFZiqF0qm4K8gGHUqOH9GLJRR3n3yfnO1nRZ4VX/YHh0ebOnbEa/FGvBOFeC8OxWdxIoZCihvxM0mSNPme/Eq30u3b0TTZZF6Jf5Du/gY/R7EV</latexit><latexit sha1_base64="eT1OBoZu+q2ozcPHjJOqWpC8g1E=">AAACVXicbVFNbxMxFPQupZTw0QBHLk+kSJxWu+0BTqgqF3orKmkjJVH01vs2a+pdW/Zzqyjqn+wF8U+4IOGkEYKWkSyNZt7Iz+PSauU5z38k6YOth9uPdh73njx99ny3/+LlmTfBSRpKo40blehJq46GrFjTyDrCttR0Xl58Wvnnl+S8Mt1XXliatjjvVK0kcpRmfX0arDWe4IrAdHoBEh0BliYwVC7MPVwpbkCjmxNQXZNkn00mvVPzJ2GD84HANnENOD4GUwM3BOwUalA17I1m3+AjHOxls/4gz/I14D4pNmQgNjiZ9W8mlZGhpY6lRu/HRW55ukTHSmq67k2CJ4vyAuc0jrTDlvx0uW7lGt5GpYLauHg6hrX6d2KJrfeLtoyTLXLj73or8X/eOHD9YbpUnQ1Mnby9qA4a2MCqYqiUizXFZiqF0qm4K8gGHUqOH9GLJRR3n3yfnO1nRZ4VX/YHh0ebOnbEa/FGvBOFeC8OxWdxIoZCihvxM0mSNPme/Eq30u3b0TTZZF6Jf5Du/gY/R7EV</latexit><latexit sha1_base64="eT1OBoZu+q2ozcPHjJOqWpC8g1E=">AAACVXicbVFNbxMxFPQupZTw0QBHLk+kSJxWu+0BTqgqF3orKmkjJVH01vs2a+pdW/Zzqyjqn+wF8U+4IOGkEYKWkSyNZt7Iz+PSauU5z38k6YOth9uPdh73njx99ny3/+LlmTfBSRpKo40blehJq46GrFjTyDrCttR0Xl58Wvnnl+S8Mt1XXliatjjvVK0kcpRmfX0arDWe4IrAdHoBEh0BliYwVC7MPVwpbkCjmxNQXZNkn00mvVPzJ2GD84HANnENOD4GUwM3BOwUalA17I1m3+AjHOxls/4gz/I14D4pNmQgNjiZ9W8mlZGhpY6lRu/HRW55ukTHSmq67k2CJ4vyAuc0jrTDlvx0uW7lGt5GpYLauHg6hrX6d2KJrfeLtoyTLXLj73or8X/eOHD9YbpUnQ1Mnby9qA4a2MCqYqiUizXFZiqF0qm4K8gGHUqOH9GLJRR3n3yfnO1nRZ4VX/YHh0ebOnbEa/FGvBOFeC8OxWdxIoZCihvxM0mSNPme/Eq30u3b0TTZZF6Jf5Du/gY/R7EV</latexit>






Suppose we only care about drugs with large e↵ects.So we only pursue phase II of the trial if Xj > 3.

<latexit sha1_base64="eT1OBoZu+q2ozcPHjJOqWpC8g1E=">AAACVXicbVFNbxMxFPQupZTw0QBHLk+kSJxWu+0BTqgqF3orKmkjJVH01vs2a+pdW/Zzqyjqn+wF8U+4IOGkEYKWkSyNZt7Iz+PSauU5z38k6YOth9uPdh73njx99ny3/+LlmTfBSRpKo40blehJq46GrFjTyDrCttR0Xl58Wvnnl+S8Mt1XXliatjjvVK0kcpRmfX0arDWe4IrAdHoBEh0BliYwVC7MPVwpbkCjmxNQXZNkn00mvVPzJ2GD84HANnENOD4GUwM3BOwUalA17I1m3+AjHOxls/4gz/I14D4pNmQgNjiZ9W8mlZGhpY6lRu/HRW55ukTHSmq67k2CJ4vyAuc0jrTDlvx0uW7lGt5GpYLauHg6hrX6d2KJrfeLtoyTLXLj73or8X/eOHD9YbpUnQ1Mnby9qA4a2MCqYqiUizXFZiqF0qm4K8gGHUqOH9GLJRR3n3yfnO1nRZ4VX/YHh0ebOnbEa/FGvBOFeC8OxWdxIoZCihvxM0mSNPme/Eq30u3b0TTZZF6Jf5Du/gY/R7EV</latexit><latexit sha1_base64="eT1OBoZu+q2ozcPHjJOqWpC8g1E=">AAACVXicbVFNbxMxFPQupZTw0QBHLk+kSJxWu+0BTqgqF3orKmkjJVH01vs2a+pdW/Zzqyjqn+wF8U+4IOGkEYKWkSyNZt7Iz+PSauU5z38k6YOth9uPdh73njx99ny3/+LlmTfBSRpKo40blehJq46GrFjTyDrCttR0Xl58Wvnnl+S8Mt1XXliatjjvVK0kcpRmfX0arDWe4IrAdHoBEh0BliYwVC7MPVwpbkCjmxNQXZNkn00mvVPzJ2GD84HANnENOD4GUwM3BOwUalA17I1m3+AjHOxls/4gz/I14D4pNmQgNjiZ9W8mlZGhpY6lRu/HRW55ukTHSmq67k2CJ4vyAuc0jrTDlvx0uW7lGt5GpYLauHg6hrX6d2KJrfeLtoyTLXLj73or8X/eOHD9YbpUnQ1Mnby9qA4a2MCqYqiUizXFZiqF0qm4K8gGHUqOH9GLJRR3n3yfnO1nRZ4VX/YHh0ebOnbEa/FGvBOFeC8OxWdxIoZCihvxM0mSNPme/Eq30u3b0TTZZF6Jf5Du/gY/R7EV</latexit><latexit sha1_base64="eT1OBoZu+q2ozcPHjJOqWpC8g1E=">AAACVXicbVFNbxMxFPQupZTw0QBHLk+kSJxWu+0BTqgqF3orKmkjJVH01vs2a+pdW/Zzqyjqn+wF8U+4IOGkEYKWkSyNZt7Iz+PSauU5z38k6YOth9uPdh73njx99ny3/+LlmTfBSRpKo40blehJq46GrFjTyDrCttR0Xl58Wvnnl+S8Mt1XXliatjjvVK0kcpRmfX0arDWe4IrAdHoBEh0BliYwVC7MPVwpbkCjmxNQXZNkn00mvVPzJ2GD84HANnENOD4GUwM3BOwUalA17I1m3+AjHOxls/4gz/I14D4pNmQgNjiZ9W8mlZGhpY6lRu/HRW55ukTHSmq67k2CJ4vyAuc0jrTDlvx0uW7lGt5GpYLauHg6hrX6d2KJrfeLtoyTLXLj73or8X/eOHD9YbpUnQ1Mnby9qA4a2MCqYqiUizXFZiqF0qm4K8gGHUqOH9GLJRR3n3yfnO1nRZ4VX/YHh0ebOnbEa/FGvBOFeC8OxWdxIoZCihvxM0mSNPme/Eq30u3b0TTZZF6Jf5Du/gY/R7EV</latexit><latexit sha1_base64="eT1OBoZu+q2ozcPHjJOqWpC8g1E=">AAACVXicbVFNbxMxFPQupZTw0QBHLk+kSJxWu+0BTqgqF3orKmkjJVH01vs2a+pdW/Zzqyjqn+wF8U+4IOGkEYKWkSyNZt7Iz+PSauU5z38k6YOth9uPdh73njx99ny3/+LlmTfBSRpKo40blehJq46GrFjTyDrCttR0Xl58Wvnnl+S8Mt1XXliatjjvVK0kcpRmfX0arDWe4IrAdHoBEh0BliYwVC7MPVwpbkCjmxNQXZNkn00mvVPzJ2GD84HANnENOD4GUwM3BOwUalA17I1m3+AjHOxls/4gz/I14D4pNmQgNjiZ9W8mlZGhpY6lRu/HRW55ukTHSmq67k2CJ4vyAuc0jrTDlvx0uW7lGt5GpYLauHg6hrX6d2KJrfeLtoyTLXLj73or8X/eOHD9YbpUnQ1Mnby9qA4a2MCqYqiUizXFZiqF0qm4K8gGHUqOH9GLJRR3n3yfnO1nRZ4VX/YHh0ebOnbEa/FGvBOFeC8OxWdxIoZCihvxM0mSNPme/Eq30u3b0TTZZF6Jf5Du/gY/R7EV</latexit>

For these drugs, the standard marginal 95% CIdoes not cover ✓j . So FCR=1.

<latexit sha1_base64="MNZd6d4WqE/DvhH93VkMRXIatwE=">AAACQHicbVA9bxNBEN0LX8F8GShpRtiRKNDpLhJKKJCiWIqgCx+OI/ksa25vzt5kb/e0OxfJsvLT0uQn0FHTUIAQLRV7iQtIeNXbN/NmZ15ea+U5Sb5Eazdu3rp9Z/1u5979Bw8fdR8/OfC2cZKG0mrrDnP0pJWhISvWdFg7wirXNMqPB219dELOK2s+8aKmSYUzo0olkYM07Y72rAOekycoXDPzL9sHeEZToCugQjdTBjX0X7/KNvoweJdlncKSB2MZpA2joZ8FC+P0qB/DRwt7gw9v0nja7SVxcgG4TtIV6YkV9qfdz1lhZVORYanR+3Ga1DxZomMlNZ12ssZTjfIYZzQO1GBFfrK8COAUNoJSQBlOKa1p9wrq344lVt4vqjx0Vshzf7XWiv+rjRsutydLZeqGycjLj8pGA1to04RCOZKsF4GgdCrsCnKODiWHzDshhPTqydfJwWacJnH6frO3s7uKY108E8/FC5GKLbEj3op9MRRSnImv4rv4EZ1H36Kf0a/L1rVo5Xkq/kH0+w9bcayH</latexit><latexit sha1_base64="MNZd6d4WqE/DvhH93VkMRXIatwE=">AAACQHicbVA9bxNBEN0LX8F8GShpRtiRKNDpLhJKKJCiWIqgCx+OI/ksa25vzt5kb/e0OxfJsvLT0uQn0FHTUIAQLRV7iQtIeNXbN/NmZ15ea+U5Sb5Eazdu3rp9Z/1u5979Bw8fdR8/OfC2cZKG0mrrDnP0pJWhISvWdFg7wirXNMqPB219dELOK2s+8aKmSYUzo0olkYM07Y72rAOekycoXDPzL9sHeEZToCugQjdTBjX0X7/KNvoweJdlncKSB2MZpA2joZ8FC+P0qB/DRwt7gw9v0nja7SVxcgG4TtIV6YkV9qfdz1lhZVORYanR+3Ga1DxZomMlNZ12ssZTjfIYZzQO1GBFfrK8COAUNoJSQBlOKa1p9wrq344lVt4vqjx0Vshzf7XWiv+rjRsutydLZeqGycjLj8pGA1to04RCOZKsF4GgdCrsCnKODiWHzDshhPTqydfJwWacJnH6frO3s7uKY108E8/FC5GKLbEj3op9MRRSnImv4rv4EZ1H36Kf0a/L1rVo5Xkq/kH0+w9bcayH</latexit><latexit sha1_base64="MNZd6d4WqE/DvhH93VkMRXIatwE=">AAACQHicbVA9bxNBEN0LX8F8GShpRtiRKNDpLhJKKJCiWIqgCx+OI/ksa25vzt5kb/e0OxfJsvLT0uQn0FHTUIAQLRV7iQtIeNXbN/NmZ15ea+U5Sb5Eazdu3rp9Z/1u5979Bw8fdR8/OfC2cZKG0mrrDnP0pJWhISvWdFg7wirXNMqPB219dELOK2s+8aKmSYUzo0olkYM07Y72rAOekycoXDPzL9sHeEZToCugQjdTBjX0X7/KNvoweJdlncKSB2MZpA2joZ8FC+P0qB/DRwt7gw9v0nja7SVxcgG4TtIV6YkV9qfdz1lhZVORYanR+3Ga1DxZomMlNZ12ssZTjfIYZzQO1GBFfrK8COAUNoJSQBlOKa1p9wrq344lVt4vqjx0Vshzf7XWiv+rjRsutydLZeqGycjLj8pGA1to04RCOZKsF4GgdCrsCnKODiWHzDshhPTqydfJwWacJnH6frO3s7uKY108E8/FC5GKLbEj3op9MRRSnImv4rv4EZ1H36Kf0a/L1rVo5Xkq/kH0+w9bcayH</latexit><latexit sha1_base64="MNZd6d4WqE/DvhH93VkMRXIatwE=">AAACQHicbVA9bxNBEN0LX8F8GShpRtiRKNDpLhJKKJCiWIqgCx+OI/ksa25vzt5kb/e0OxfJsvLT0uQn0FHTUIAQLRV7iQtIeNXbN/NmZ15ea+U5Sb5Eazdu3rp9Z/1u5979Bw8fdR8/OfC2cZKG0mrrDnP0pJWhISvWdFg7wirXNMqPB219dELOK2s+8aKmSYUzo0olkYM07Y72rAOekycoXDPzL9sHeEZToCugQjdTBjX0X7/KNvoweJdlncKSB2MZpA2joZ8FC+P0qB/DRwt7gw9v0nja7SVxcgG4TtIV6YkV9qfdz1lhZVORYanR+3Ga1DxZomMlNZ12ssZTjfIYZzQO1GBFfrK8COAUNoJSQBlOKa1p9wrq344lVt4vqjx0Vshzf7XWiv+rjRsutydLZeqGycjLj8pGA1to04RCOZKsF4GgdCrsCnKODiWHzDshhPTqydfJwWacJnH6frO3s7uKY108E8/FC5GKLbEj3op9MRRSnImv4rv4EZ1H36Kf0a/L1rVo5Xkq/kH0+w9bcayH</latexit>





Can we control FCR *online*?

When experiment j starts, we must assigna target confidence level ↵j .When experiment j ends, we must decideif we wish to report ✓j .This must be done such that the FCRis controlled at any time.

<latexit sha1_base64="5F3c2hZSIZVBX2jI9lT37fkUWZU=">AAAC33icbVJNb9NAEF2brxK+Ahy5jGiQOKDI7gWOFZUQx4KaplIcRev1ON52vWvtjluiKBcuHECIK3+LG3+EM+MmSKXtSLae570Zz77ZvDE6UJL8juIbN2/dvrN1t3fv/oOHj/qPnxwG13qFI+WM80e5DGi0xRFpMnjUeJR1bnCcn+x1/PgUfdDOHtCiwWkt51aXWkni1Kz/J7NO2wItwbhCC/ipQa/r7ntwPIBA0lN4BWcIdRsIZAh6brMMehKYmiOBcrbU3EAhGDxFA4NMmqaSs+PBMMt613VFW1zoWaDiepbqssud6VABOfDYOM/6jCqkf90OKh3WVTlC4SxCaBXLK0n8Qni397EbjkU8FnlnDBbAXIZ1Uy2lXayAeI7hrL+dDJPzgKsg3YBtsYn9Wf9XVjjVdkdQhl2YpElD0yW7o5XBVS9rAzZSncg5ThhaWWOYLs/3s4IXnCmgdJ4f2znG2YsVS1mHsKhzVtaSqnCZ65LXcZOWyjfTpbZNS7yA9Y/K1nT2dcuGQntUZBYMpPKaZwVVSS8V8ZXosQnp5SNfBYc7wzQZph92tnffbuzYEs/Ec/FSpOK12BXvxb4YCRVl0efoa/QtlvGX+Hv8Yy2No03NU/FfxD//Aijn46o=</latexit><latexit sha1_base64="5F3c2hZSIZVBX2jI9lT37fkUWZU=">AAAC33icbVJNb9NAEF2brxK+Ahy5jGiQOKDI7gWOFZUQx4KaplIcRev1ON52vWvtjluiKBcuHECIK3+LG3+EM+MmSKXtSLae570Zz77ZvDE6UJL8juIbN2/dvrN1t3fv/oOHj/qPnxwG13qFI+WM80e5DGi0xRFpMnjUeJR1bnCcn+x1/PgUfdDOHtCiwWkt51aXWkni1Kz/J7NO2wItwbhCC/ipQa/r7ntwPIBA0lN4BWcIdRsIZAh6brMMehKYmiOBcrbU3EAhGDxFA4NMmqaSs+PBMMt613VFW1zoWaDiepbqssud6VABOfDYOM/6jCqkf90OKh3WVTlC4SxCaBXLK0n8Qni397EbjkU8FnlnDBbAXIZ1Uy2lXayAeI7hrL+dDJPzgKsg3YBtsYn9Wf9XVjjVdkdQhl2YpElD0yW7o5XBVS9rAzZSncg5ThhaWWOYLs/3s4IXnCmgdJ4f2znG2YsVS1mHsKhzVtaSqnCZ65LXcZOWyjfTpbZNS7yA9Y/K1nT2dcuGQntUZBYMpPKaZwVVSS8V8ZXosQnp5SNfBYc7wzQZph92tnffbuzYEs/Ec/FSpOK12BXvxb4YCRVl0efoa/QtlvGX+Hv8Yy2No03NU/FfxD//Aijn46o=</latexit><latexit sha1_base64="5F3c2hZSIZVBX2jI9lT37fkUWZU=">AAAC33icbVJNb9NAEF2brxK+Ahy5jGiQOKDI7gWOFZUQx4KaplIcRev1ON52vWvtjluiKBcuHECIK3+LG3+EM+MmSKXtSLae570Zz77ZvDE6UJL8juIbN2/dvrN1t3fv/oOHj/qPnxwG13qFI+WM80e5DGi0xRFpMnjUeJR1bnCcn+x1/PgUfdDOHtCiwWkt51aXWkni1Kz/J7NO2wItwbhCC/ipQa/r7ntwPIBA0lN4BWcIdRsIZAh6brMMehKYmiOBcrbU3EAhGDxFA4NMmqaSs+PBMMt613VFW1zoWaDiepbqssud6VABOfDYOM/6jCqkf90OKh3WVTlC4SxCaBXLK0n8Qni397EbjkU8FnlnDBbAXIZ1Uy2lXayAeI7hrL+dDJPzgKsg3YBtsYn9Wf9XVjjVdkdQhl2YpElD0yW7o5XBVS9rAzZSncg5ThhaWWOYLs/3s4IXnCmgdJ4f2znG2YsVS1mHsKhzVtaSqnCZ65LXcZOWyjfTpbZNS7yA9Y/K1nT2dcuGQntUZBYMpPKaZwVVSS8V8ZXosQnp5SNfBYc7wzQZph92tnffbuzYEs/Ec/FSpOK12BXvxb4YCRVl0efoa/QtlvGX+Hv8Yy2No03NU/FfxD//Aijn46o=</latexit><latexit sha1_base64="5F3c2hZSIZVBX2jI9lT37fkUWZU=">AAAC33icbVJNb9NAEF2brxK+Ahy5jGiQOKDI7gWOFZUQx4KaplIcRev1ON52vWvtjluiKBcuHECIK3+LG3+EM+MmSKXtSLae570Zz77ZvDE6UJL8juIbN2/dvrN1t3fv/oOHj/qPnxwG13qFI+WM80e5DGi0xRFpMnjUeJR1bnCcn+x1/PgUfdDOHtCiwWkt51aXWkni1Kz/J7NO2wItwbhCC/ipQa/r7ntwPIBA0lN4BWcIdRsIZAh6brMMehKYmiOBcrbU3EAhGDxFA4NMmqaSs+PBMMt613VFW1zoWaDiepbqssud6VABOfDYOM/6jCqkf90OKh3WVTlC4SxCaBXLK0n8Qni397EbjkU8FnlnDBbAXIZ1Uy2lXayAeI7hrL+dDJPzgKsg3YBtsYn9Wf9XVjjVdkdQhl2YpElD0yW7o5XBVS9rAzZSncg5ThhaWWOYLs/3s4IXnCmgdJ4f2znG2YsVS1mHsKhzVtaSqnCZ65LXcZOWyjfTpbZNS7yA9Y/K1nT2dcuGQntUZBYMpPKaZwVVSS8V8ZXosQnp5SNfBYc7wzQZph92tnffbuzYEs/Ec/FSpOK12BXvxb4YCRVl0efoa/QtlvGX+Hv8Yy2No03NU/FfxD//Aijn46o=</latexit>

Can we control FCR *online*?

When experiment j starts, we must assigna target confidence level ↵j .When experiment j ends, we must decideif we wish to report ✓j .This must be done such that the FCRis controlled at any time.

<latexit sha1_base64="5F3c2hZSIZVBX2jI9lT37fkUWZU=">AAAC33icbVJNb9NAEF2brxK+Ahy5jGiQOKDI7gWOFZUQx4KaplIcRev1ON52vWvtjluiKBcuHECIK3+LG3+EM+MmSKXtSLae570Zz77ZvDE6UJL8juIbN2/dvrN1t3fv/oOHj/qPnxwG13qFI+WM80e5DGi0xRFpMnjUeJR1bnCcn+x1/PgUfdDOHtCiwWkt51aXWkni1Kz/J7NO2wItwbhCC/ipQa/r7ntwPIBA0lN4BWcIdRsIZAh6brMMehKYmiOBcrbU3EAhGDxFA4NMmqaSs+PBMMt613VFW1zoWaDiepbqssud6VABOfDYOM/6jCqkf90OKh3WVTlC4SxCaBXLK0n8Qni397EbjkU8FnlnDBbAXIZ1Uy2lXayAeI7hrL+dDJPzgKsg3YBtsYn9Wf9XVjjVdkdQhl2YpElD0yW7o5XBVS9rAzZSncg5ThhaWWOYLs/3s4IXnCmgdJ4f2znG2YsVS1mHsKhzVtaSqnCZ65LXcZOWyjfTpbZNS7yA9Y/K1nT2dcuGQntUZBYMpPKaZwVVSS8V8ZXosQnp5SNfBYc7wzQZph92tnffbuzYEs/Ec/FSpOK12BXvxb4YCRVl0efoa/QtlvGX+Hv8Yy2No03NU/FfxD//Aijn46o=</latexit><latexit sha1_base64="5F3c2hZSIZVBX2jI9lT37fkUWZU=">AAAC33icbVJNb9NAEF2brxK+Ahy5jGiQOKDI7gWOFZUQx4KaplIcRev1ON52vWvtjluiKBcuHECIK3+LG3+EM+MmSKXtSLae570Zz77ZvDE6UJL8juIbN2/dvrN1t3fv/oOHj/qPnxwG13qFI+WM80e5DGi0xRFpMnjUeJR1bnCcn+x1/PgUfdDOHtCiwWkt51aXWkni1Kz/J7NO2wItwbhCC/ipQa/r7ntwPIBA0lN4BWcIdRsIZAh6brMMehKYmiOBcrbU3EAhGDxFA4NMmqaSs+PBMMt613VFW1zoWaDiepbqssud6VABOfDYOM/6jCqkf90OKh3WVTlC4SxCaBXLK0n8Qni397EbjkU8FnlnDBbAXIZ1Uy2lXayAeI7hrL+dDJPzgKsg3YBtsYn9Wf9XVjjVdkdQhl2YpElD0yW7o5XBVS9rAzZSncg5ThhaWWOYLs/3s4IXnCmgdJ4f2znG2YsVS1mHsKhzVtaSqnCZ65LXcZOWyjfTpbZNS7yA9Y/K1nT2dcuGQntUZBYMpPKaZwVVSS8V8ZXosQnp5SNfBYc7wzQZph92tnffbuzYEs/Ec/FSpOK12BXvxb4YCRVl0efoa/QtlvGX+Hv8Yy2No03NU/FfxD//Aijn46o=</latexit><latexit sha1_base64="5F3c2hZSIZVBX2jI9lT37fkUWZU=">AAAC33icbVJNb9NAEF2brxK+Ahy5jGiQOKDI7gWOFZUQx4KaplIcRev1ON52vWvtjluiKBcuHECIK3+LG3+EM+MmSKXtSLae570Zz77ZvDE6UJL8juIbN2/dvrN1t3fv/oOHj/qPnxwG13qFI+WM80e5DGi0xRFpMnjUeJR1bnCcn+x1/PgUfdDOHtCiwWkt51aXWkni1Kz/J7NO2wItwbhCC/ipQa/r7ntwPIBA0lN4BWcIdRsIZAh6brMMehKYmiOBcrbU3EAhGDxFA4NMmqaSs+PBMMt613VFW1zoWaDiepbqssud6VABOfDYOM/6jCqkf90OKh3WVTlC4SxCaBXLK0n8Qni397EbjkU8FnlnDBbAXIZ1Uy2lXayAeI7hrL+dDJPzgKsg3YBtsYn9Wf9XVjjVdkdQhl2YpElD0yW7o5XBVS9rAzZSncg5ThhaWWOYLs/3s4IXnCmgdJ4f2znG2YsVS1mHsKhzVtaSqnCZ65LXcZOWyjfTpbZNS7yA9Y/K1nT2dcuGQntUZBYMpPKaZwVVSS8V8ZXosQnp5SNfBYc7wzQZph92tnffbuzYEs/Ec/FSpOK12BXvxb4YCRVl0efoa/QtlvGX+Hv8Yy2No03NU/FfxD//Aijn46o=</latexit><latexit sha1_base64="5F3c2hZSIZVBX2jI9lT37fkUWZU=">AAAC33icbVJNb9NAEF2brxK+Ahy5jGiQOKDI7gWOFZUQx4KaplIcRev1ON52vWvtjluiKBcuHECIK3+LG3+EM+MmSKXtSLae570Zz77ZvDE6UJL8juIbN2/dvrN1t3fv/oOHj/qPnxwG13qFI+WM80e5DGi0xRFpMnjUeJR1bnCcn+x1/PgUfdDOHtCiwWkt51aXWkni1Kz/J7NO2wItwbhCC/ipQa/r7ntwPIBA0lN4BWcIdRsIZAh6brMMehKYmiOBcrbU3EAhGDxFA4NMmqaSs+PBMMt613VFW1zoWaDiepbqssud6VABOfDYOM/6jCqkf90OKh3WVTlC4SxCaBXLK0n8Qni397EbjkU8FnlnDBbAXIZ1Uy2lXayAeI7hrL+dDJPzgKsg3YBtsYn9Wf9XVjjVdkdQhl2YpElD0yW7o5XBVS9rAzZSncg5ThhaWWOYLs/3s4IXnCmgdJ4f2znG2YsVS1mHsKhzVtaSqnCZ65LXcZOWyjfTpbZNS7yA9Y/K1nT2dcuGQntUZBYMpPKaZwVVSS8V8ZXosQnp5SNfBYc7wzQZph92tnffbuzYEs/Ec/FSpOK12BXvxb4YCRVl0efoa/QtlvGX+Hv8Yy2No03NU/FfxD//Aijn46o=</latexit>

Weinstein, Ramdas ‘19

A simple solution for both online FDR control, andonline FCR control

Online FCR control: the main idea

Maintain [FCP(T ) :=PT

i=1 ↵i

1 _PT

i=1 Si

↵.<latexit sha1_base64="h2Dsi8pb9WFrVY+BGXU36brflXA=">AAACVnicbVFNa9wwEJWdpkm3X0567EV0KaSXxS6BhkAgNFB6KWzpbhJYb81YO86KSLIjjdMsxn+yvbQ/pZdS7a4PTdKBgcd7b0bSU14p6SiOfwXhxoPNh1vbj3qPnzx99jza2T11ZW0FjkWpSnueg0MlDY5JksLzyiLoXOFZfnmy1M+u0TpZmhEtKpxquDCykALIU1mkU8Ibaj6BNOSbtzz9Jmc4B2rWyoeTYdvujd4cHqWFBdGkrtZZI4+S9uuIp6CqOWSSt03C02tEfkv+kkm/T+FVZxxkUT8exKvi90HSgT7raphF39NZKWqNhoQC5yZJXNG0AUtSKGx7ae2wAnEJFzjx0IBGN21WsbT8tWdmvCitb0N8xf470YB2bqFz79RAc3dXW5L/0yY1FQfTRpqqJjRifVBRK04lX2bMZ9KiILXwAISV/q5czMGnR/4nej6E5O6T74PTt4MkHiSf9/vH77s4ttlL9ortsYS9Y8fsIxuyMRPsB/sdhMFG8DP4E26GW2trGHQzL9itCqO/IGOz7g==</latexit><latexit sha1_base64="h2Dsi8pb9WFrVY+BGXU36brflXA=">AAACVnicbVFNa9wwEJWdpkm3X0567EV0KaSXxS6BhkAgNFB6KWzpbhJYb81YO86KSLIjjdMsxn+yvbQ/pZdS7a4PTdKBgcd7b0bSU14p6SiOfwXhxoPNh1vbj3qPnzx99jza2T11ZW0FjkWpSnueg0MlDY5JksLzyiLoXOFZfnmy1M+u0TpZmhEtKpxquDCykALIU1mkU8Ibaj6BNOSbtzz9Jmc4B2rWyoeTYdvujd4cHqWFBdGkrtZZI4+S9uuIp6CqOWSSt03C02tEfkv+kkm/T+FVZxxkUT8exKvi90HSgT7raphF39NZKWqNhoQC5yZJXNG0AUtSKGx7ae2wAnEJFzjx0IBGN21WsbT8tWdmvCitb0N8xf470YB2bqFz79RAc3dXW5L/0yY1FQfTRpqqJjRifVBRK04lX2bMZ9KiILXwAISV/q5czMGnR/4nej6E5O6T74PTt4MkHiSf9/vH77s4ttlL9ortsYS9Y8fsIxuyMRPsB/sdhMFG8DP4E26GW2trGHQzL9itCqO/IGOz7g==</latexit><latexit sha1_base64="h2Dsi8pb9WFrVY+BGXU36brflXA=">AAACVnicbVFNa9wwEJWdpkm3X0567EV0KaSXxS6BhkAgNFB6KWzpbhJYb81YO86KSLIjjdMsxn+yvbQ/pZdS7a4PTdKBgcd7b0bSU14p6SiOfwXhxoPNh1vbj3qPnzx99jza2T11ZW0FjkWpSnueg0MlDY5JksLzyiLoXOFZfnmy1M+u0TpZmhEtKpxquDCykALIU1mkU8Ibaj6BNOSbtzz9Jmc4B2rWyoeTYdvujd4cHqWFBdGkrtZZI4+S9uuIp6CqOWSSt03C02tEfkv+kkm/T+FVZxxkUT8exKvi90HSgT7raphF39NZKWqNhoQC5yZJXNG0AUtSKGx7ae2wAnEJFzjx0IBGN21WsbT8tWdmvCitb0N8xf470YB2bqFz79RAc3dXW5L/0yY1FQfTRpqqJjRifVBRK04lX2bMZ9KiILXwAISV/q5czMGnR/4nej6E5O6T74PTt4MkHiSf9/vH77s4ttlL9ortsYS9Y8fsIxuyMRPsB/sdhMFG8DP4E26GW2trGHQzL9itCqO/IGOz7g==</latexit><latexit sha1_base64="h2Dsi8pb9WFrVY+BGXU36brflXA=">AAACVnicbVFNa9wwEJWdpkm3X0567EV0KaSXxS6BhkAgNFB6KWzpbhJYb81YO86KSLIjjdMsxn+yvbQ/pZdS7a4PTdKBgcd7b0bSU14p6SiOfwXhxoPNh1vbj3qPnzx99jza2T11ZW0FjkWpSnueg0MlDY5JksLzyiLoXOFZfnmy1M+u0TpZmhEtKpxquDCykALIU1mkU8Ibaj6BNOSbtzz9Jmc4B2rWyoeTYdvujd4cHqWFBdGkrtZZI4+S9uuIp6CqOWSSt03C02tEfkv+kkm/T+FVZxxkUT8exKvi90HSgT7raphF39NZKWqNhoQC5yZJXNG0AUtSKGx7ae2wAnEJFzjx0IBGN21WsbT8tWdmvCitb0N8xf470YB2bqFz79RAc3dXW5L/0yY1FQfTRpqqJjRifVBRK04lX2bMZ9KiILXwAISV/q5czMGnR/4nej6E5O6T74PTt4MkHiSf9/vH77s4ttlL9ortsYS9Y8fsIxuyMRPsB/sdhMFG8DP4E26GW2trGHQzL9itCqO/IGOz7g==</latexit>

Let Si 2 {0, 1} denote the selection decisionmade after experiment i.

<latexit sha1_base64="GOHWK6b8TQqolUAuucVsdwvzEVA=">AAACRHicbVA9T8MwEHX4pnwVGFlOtEgMqEpYYESwMDCAoAWpqSrHuVALx45sB1FF/XEs/AA2fgELAwixIpy2A18nWXp67+7d+UWZ4Mb6/pM3MTk1PTM7N19ZWFxaXqmurrWMyjXDJlNC6auIGhRcYtNyK/Aq00jTSOBldHNU6pe3qA1X8sL2M+yk9FryhDNqHdWttkOpuIxRWjhBC/XzLoeQSwgLfycIB3VwkrIItofgtiArxxzJeGkZhlBJaYxAE4sa8C5DzdPSrM7rjW615jf8YcFfEIxBjYzrtFt9DGPF8tKACWpMO/Az2ymotpwJHFTC3GBG2Q29xraDkqZoOsUwhAFsOSaGRGn33AFD9vtEQVNj+mnkOlNqe+a3VpL/ae3cJvudgssstyjZaFGSC7AKykQh5trFIvoOUKa5uxVYj2rKXCSm4kIIfn/5L2jtNgK/EZzt1g4Ox3HMkQ2ySbZJQPbIATkmp6RJGLknz+SVvHkP3ov37n2MWie88cw6+VHe5xcxvK/J</latexit><latexit sha1_base64="GOHWK6b8TQqolUAuucVsdwvzEVA=">AAACRHicbVA9T8MwEHX4pnwVGFlOtEgMqEpYYESwMDCAoAWpqSrHuVALx45sB1FF/XEs/AA2fgELAwixIpy2A18nWXp67+7d+UWZ4Mb6/pM3MTk1PTM7N19ZWFxaXqmurrWMyjXDJlNC6auIGhRcYtNyK/Aq00jTSOBldHNU6pe3qA1X8sL2M+yk9FryhDNqHdWttkOpuIxRWjhBC/XzLoeQSwgLfycIB3VwkrIItofgtiArxxzJeGkZhlBJaYxAE4sa8C5DzdPSrM7rjW615jf8YcFfEIxBjYzrtFt9DGPF8tKACWpMO/Az2ymotpwJHFTC3GBG2Q29xraDkqZoOsUwhAFsOSaGRGn33AFD9vtEQVNj+mnkOlNqe+a3VpL/ae3cJvudgssstyjZaFGSC7AKykQh5trFIvoOUKa5uxVYj2rKXCSm4kIIfn/5L2jtNgK/EZzt1g4Ox3HMkQ2ySbZJQPbIATkmp6RJGLknz+SVvHkP3ov37n2MWie88cw6+VHe5xcxvK/J</latexit><latexit sha1_base64="GOHWK6b8TQqolUAuucVsdwvzEVA=">AAACRHicbVA9T8MwEHX4pnwVGFlOtEgMqEpYYESwMDCAoAWpqSrHuVALx45sB1FF/XEs/AA2fgELAwixIpy2A18nWXp67+7d+UWZ4Mb6/pM3MTk1PTM7N19ZWFxaXqmurrWMyjXDJlNC6auIGhRcYtNyK/Aq00jTSOBldHNU6pe3qA1X8sL2M+yk9FryhDNqHdWttkOpuIxRWjhBC/XzLoeQSwgLfycIB3VwkrIItofgtiArxxzJeGkZhlBJaYxAE4sa8C5DzdPSrM7rjW615jf8YcFfEIxBjYzrtFt9DGPF8tKACWpMO/Az2ymotpwJHFTC3GBG2Q29xraDkqZoOsUwhAFsOSaGRGn33AFD9vtEQVNj+mnkOlNqe+a3VpL/ae3cJvudgssstyjZaFGSC7AKykQh5trFIvoOUKa5uxVYj2rKXCSm4kIIfn/5L2jtNgK/EZzt1g4Ox3HMkQ2ySbZJQPbIATkmp6RJGLknz+SVvHkP3ov37n2MWie88cw6+VHe5xcxvK/J</latexit><latexit sha1_base64="GOHWK6b8TQqolUAuucVsdwvzEVA=">AAACRHicbVA9T8MwEHX4pnwVGFlOtEgMqEpYYESwMDCAoAWpqSrHuVALx45sB1FF/XEs/AA2fgELAwixIpy2A18nWXp67+7d+UWZ4Mb6/pM3MTk1PTM7N19ZWFxaXqmurrWMyjXDJlNC6auIGhRcYtNyK/Aq00jTSOBldHNU6pe3qA1X8sL2M+yk9FryhDNqHdWttkOpuIxRWjhBC/XzLoeQSwgLfycIB3VwkrIItofgtiArxxzJeGkZhlBJaYxAE4sa8C5DzdPSrM7rjW615jf8YcFfEIxBjYzrtFt9DGPF8tKACWpMO/Az2ymotpwJHFTC3GBG2Q29xraDkqZoOsUwhAFsOSaGRGn33AFD9vtEQVNj+mnkOlNqe+a3VpL/ae3cJvudgssstyjZaFGSC7AKykQh5trFIvoOUKa5uxVYj2rKXCSm4kIIfn/5L2jtNgK/EZzt1g4Ox3HMkQ2ySbZJQPbIATkmp6RJGLknz+SVvHkP3ov37n2MWie88cw6+VHe5xcxvK/J</latexit>


Weinstein & Ramdas ’19


i=1 ↵i

1 _PT

i=1 Si







i=1 ↵i

1 _PT

i=1 Si




For testing, let Ri ∈ {0,1} represent a rejection.

Then maintain ̂FDP (T ) :=∑T

i=1 αi

1 ∨ ∑Ti=1 Ri

≤ α .




i=1 ↵i

1 _PT

i=1 Si




For testing, let Ri ∈ {0,1} represent a rejection.

Then maintain ̂FDP (T ) :=∑T

i=1 αi

1 ∨ ∑Ti=1 Ri

≤ α .

This provably controls FCR/FDR at level α .

Online FCR control : high-level picture

Remaining error budget or “alpha-wealth”



Error budget for first expt.

α1




Error budget for second expt.

α1

α2





Expts. use wealth

α1

α2

α3





Expts. use wealthSelections

earn wealth

α1

α2

α3

α4






earn wealth

Error budgetis data-dependent

α1

α2

α3

α4

α5






earn wealth

Error budgetis data-dependent

Infinite process

α1

α2

α3

α4

α5

α6

Summary of this section


even if hypotheses, data, p-values are independent.8t, errort ↵ does not imply 8t, FDR(t) ↵,


even if hypotheses, data, p-values are independent.

Can track a running estimate of the FDP (or FCP):a simple update rule to keep this estimate boundedalso results in the FDR (or FCR) being controlled.

8t, errort ↵ does not imply 8t, FDR(t) ↵,

Handling local dependence

Most online FDR algorithms assume independent p-values(but hypotheses can be dependent).



However, assuming arbitrary dependence between all p-valuesis also extremely pessimistic and unrealistic.




A middle ground is a flexible notion of local dependence:Pt arbitrarily depends on the previous Lt p-values,where Lt is a user-chosen lag-parameter .


Zrnic, Ramdas, Jordan ’18



A middle ground is a flexible notion of local dependence:Pt arbitrarily depends on the previous Lt p-values,where Lt is a user-chosen lag-parameter .

The online FDR and FCR algorithms can be easily modifiedto handle local dependence.




Part I






Part II

Part I






Part II

Part I


Part III

Putting the modular pieces together: the doubly-sequential process

[Next 10 mins]

Putting the modular pieces together: the doubly-sequential process

[Next 10 mins]

Part III

Combining inner and outer solutions (FCR):

α1

α2

α3

αi

α4

α5


α1

α2

α3

αi

α4

α5

(a) Online FCR method assigns αi when expt. starts


α1

α2

α3

αi

α4

α5

(a) Online FCR method assigns αi when expt. starts(b) We keep track of (1 − αi) confidence sequence


α1

α2

α3

αi

α4

α5

(a) Online FCR method assigns αi when expt. starts(b) We keep track of (1 − αi) confidence sequence(c) Adaptively decide to stop, to report final CI or not


α1

α2

α3

αi

α4

α5

(a) Online FCR method assigns αi when expt. starts(b) We keep track of (1 − αi) confidence sequence(c) Adaptively decide to stop, to report final CI or not(d) Guarantee FCR(T) ≤ α at any time positive time T

Time / Samples

Exp.


α1

α2

α3

αi

α4

α5

(a) Online FCR method assigns αi when expt. starts(b) We keep track of (1 − αi) confidence sequence(c) Adaptively decide to stop, to report final CI or not(d) Guarantee FCR(T) ≤ α at any time positive time T

Time / Samples

Exp.

Combining inner and outer solutions (FDR):

α1

α2

α3

αi

α4

α5

Time / Samples

Exp.


α1

α2

α3

αi

α4

α5

(a) Online FDR method assigns αi when expt. starts

Time / Samples

Exp.


α1

α2

α3

αi

α4

α5

(a) Online FDR method assigns αi when expt. starts(b) We keep track of anytime p-value P(n)

i

Time / Samples

Exp.


α1

α2

α3

αi

α4

α5


i(c) Adaptively stop at time τ, report discovery if P(τ)

i ≤ αi

Time / Samples

Exp.


α1

α2

α3

αi

α4

α5


i(c) Adaptively stop at time τ, report discovery if P(τ)

i ≤ αi(d) Guarantee FDR(T) ≤ α at any time positive time T

PART IV: Advanced topics (inner sequential process)

[Next 25 mins]

Much more traffic needed by an A/B/n test

1. What if we are testing more than one alternative?

1. Multi-armed bandits for hypothesis testing

What would you do?


What would you do?Depends on the aim: minimize regret OR identify best arm?We would like to test null hypothesis

H0 : μA ≥ max{μB, μC} .



Can design variant of UCB algorithms to define anytime p-value, with optimal sample complexity for high power.



Yang, Ramdas, Jamieson, Wainwright ’17


Can design variant of UCB algorithms to define anytime p-value, with optimal sample complexity for high power.


MAB-FDR meta algorithm

𝛼𝛼𝑗𝑗 𝑅𝑅𝑗𝑗 (𝛼𝛼𝑗𝑗)

Exp j

MAB Test𝑝𝑝𝑗𝑗 < 𝛼𝛼𝑗𝑗

𝑝𝑝𝑗𝑗 (𝛼𝛼𝑗𝑗)

𝛼𝛼j+1 𝑅𝑅j+1 (𝛼𝛼j+1)

Exp j+1

MAB Test𝑝𝑝j+1 < 𝛼𝛼j+1

𝑝𝑝j+1 (𝛼𝛼j+1)

Online FDR procedure

……

desired FDR level 𝛼𝛼

2. Switch from estimating means to quantiles?

Let X ∼ F . Define the α-quantile as qα := sup{x : F(x) ≤ α} .

(hence q1/2 is the median)


Howard, Ramdas ‘19







Reasons to use quantiles include:





Reasons to use quantiles include:Quantiles always exist for any distribution, while means (moments) do not always exist (eg: Cauchy).





Reasons to use quantiles include:Quantiles always exist for any distribution, while means (moments) do not always exist (eg: Cauchy).Quantiles can be defined for any totally ordered space,eg: ratings A-F, where “distance between ratings” undefined.





Reasons to use quantiles include:Quantiles always exist for any distribution, while means (moments) do not always exist (eg: Cauchy).Quantiles can be defined for any totally ordered space,eg: ratings A-F, where “distance between ratings” undefined.Estimating quantiles can be done sequentially, withoutany tail assumptions, unlike estimating means.





Reasons to use quantiles include:Quantiles always exist for any distribution, while means (moments) do not always exist (eg: Cauchy).Quantiles can be defined for any totally ordered space,eg: ratings A-F, where “distance between ratings” undefined.Estimating quantiles can be done sequentially, withoutany tail assumptions, unlike estimating means.Can run A/B tests and get always valid p-valuesfor testing the difference in quantiles.





Reasons to use quantiles include:Quantiles always exist for any distribution, while means (moments) do not always exist (eg: Cauchy).Quantiles can be defined for any totally ordered space,eg: ratings A-F, where “distance between ratings” undefined.Estimating quantiles can be done sequentially, withoutany tail assumptions, unlike estimating means.Can run A/B tests and get always valid p-valuesfor testing the difference in quantiles.Can run bandit experiments, including best-arm identification.





Reasons to use quantiles include:Quantiles always exist for any distribution, while means (moments) do not always exist (eg: Cauchy).Quantiles can be defined for any totally ordered space,eg: ratings A-F, where “distance between ratings” undefined.Estimating quantiles can be done sequentially, withoutany tail assumptions, unlike estimating means.Can run A/B tests and get always valid p-valuesfor testing the difference in quantiles.Can run bandit experiments, including best-arm identification.

Can also estimate all quantiles simultaneously!

2. Quantiles are informative for heavy tails

Possible observations

Prob.density

q0 q1/2 q0.8

Mean = + ∞

Eg: amount of time spent on Reddit

(heavy right tail)

2. The mean need not even exist (eg: Cauchy)

Prob.density

q0.3 q1/2 q0.8

Mean is undefined.

Eg: amount of money won/lost in a casino

(heavy right tail)(heavy left tail)

2. The mean need not even exist (eg: Cauchy)

Prob.density

q0.3 q1/2 q0.8

Mean is undefined.

Eg: amount of money won/lost in a casino

Do not need to resort to trimming “outliers”.(How to pick threshold? Throw away or cap?)

(heavy right tail)(heavy left tail)

2. The same could arise in discrete settings

Prob.mass

q0.8q1/2

Mean = + ∞

Eg: number of links clicked

…

pk ∝ 1/k2 ⟹

2. Quantile sensible in totally ordered settings

Prob.mass

q0.8q1/2

Mean is undefined.

Eg: grades or non-numerical ratings

…A < B < C < D < E < …

2. Quantile sensible in totally ordered settings

Prob.mass

q0.8q1/2

Mean is undefined.

Eg: grades or non-numerical ratings

…A < B < C < D < E < …

Do not need to artificially assign numerical values.(Are they equally spaced? Spacing and start point matter.)

2. A/B testing with quantiles

H0 : q0.9(A) = q0.9(B)

First pick target quantile α (say 0.9).

H1 : q0.9(A) < q0.9(B)


H0 : q0.9(A) = q0.9(B)


H1 : q0.9(A) < q0.9(B)



H0 : q0.9(A) = q0.9(B)


H1 : q0.9(A) < q0.9(B)


Can construct always valid p-value.


H0 : q0.9(A) = q0.9(B)


H1 : q0.9(A) < q0.9(B)


If numerical, can construct confidence sequence for q0.9(B) − q0.9(A) .



H0 : q0.9(A) = q0.9(B)


H1 : q0.9(A) < q0.9(B)


If numerical, can construct confidence sequence for q0.9(B) − q0.9(A) .


(In that case, one way to define sequential p-value is the smallest δ such that the (1 − δ) CS overlaps with ℝ−

0 .)

2. Best-arm identification with quantiles


…

Which arm has the highest 80% quantile?



…


Can design MAB algorithms to adaptively determinethe “best” arm with a prescribed failure probability.



…


Can design MAB algorithms to adaptively determinethe “best” arm with a prescribed failure probability.

If the first arm is “special”, can design MAB algorithms to adaptivelytest the null hypothesis that A is best, and get a sequential p-value.

3. Running intersections or minimums: pros/cons

Fact 1: if P(n) is an anytime p-value, so is minm≤n

P(m) .

Fact 2: if (L(n), U(n)) is a confidence sequence, so is ⋂m≤n

(L(n), U(n)) .




P(m) .


(L(n), U(n)) .




P(m) .


(L(n), U(n)) .

Pro of taking running intersections of CIs : Smaller width, hence tighter inference, without inflating error.




P(m) .


(L(n), U(n)) .


Con of taking running intersections of CIs : Can have intervals of decreasing width (great!) and then in thenext step, end up with an empty interval (disconcerting).




P(m) .


(L(n), U(n)) .


Con of taking running intersections of CIs : Can have intervals of decreasing width (great!) and then in thenext step, end up with an empty interval (disconcerting).

Pro of ending up with zero width : “Failing loudly”: you know you’re in the low-probability errorevent, or assumptions have been violated.

Can change with time!(eg: keep groups balanced)

4. Sequential Average Treatment Effect estimation with adaptive randomization


Sign up!

Sign up!……

50% 50%

A B

Can change with time!(eg: keep groups balanced)


Can infer the treatment effect sequentially (Neyman-Rubin potential outcomes model) using anytime p-value or CI.

4. Sequential Average Treatment Effect estimation with adaptive randomization


Sign up!

Sign up!……

50% 50%

A B

PART V: Advanced topics (outer sequential process)

[Next 15 mins]

Recent tests may be more relevant than older ones, and hence we may wish to smoothly forget the past.

1. Smoothly forgetting the past


With this motivation, we may wish to control the

decaying memory FDR: (user-chosen decay d < 1)





mem-FDR(T ) = EP

t dT�t1(false discoveryt)Pt d

T�t1(discoveryt)

�

Ramdas et al. ’17





mem-FDR(T ) = EP

t dT�t1(false discoveryt)Pt d

T�t1(discoveryt)

�

Ramdas et al. ’17


(similarly mem-FCR)

2. Post-hoc analysis


Katsevich, Ramdas ’18



What if you did not use an online FCR or FDR algorithm,but at the end of the year, you would like to answer “based on the decisions made and error levels used, how large could my FCR or FDR be?”




FDPt ≤1 + ∑i≤t αi

∑i≤t Ri⋅

log(1/δ)log(1 + log(1/δ))

simultaneously for all t .

With probability at least 1 − δ we have




FDPt ≤1 + ∑i≤t αi

∑i≤t Ri⋅



With probability at least 1 − δ we have

FCPt ≤1 + ∑i≤t αi

∑i≤t Si⋅



3. Weighted error metrics and algorithms


But, different experiments may have differing importances.The usual error metrics count all mistakes equally.


Benjamini, Hochberg ’97Ramdas, Yang, Jordan, Wainwright ’17


Can define “weighted” variants of FDR and FCR in the natural way: weighted sums in numerator/denominator.


Benjamini, Hochberg ’97Ramdas, Yang, Jordan, Wainwright ’17


Can define “weighted” variants of FDR and FCR in the natural way: weighted sums in numerator/denominator.

Online FDR and FCR algorithms can be extended to control weighted error metrics.

4. False-sign rate

4. False-sign rate

Sometimes, all we want is a “sign decision” about parameter:an output of +1 if treatment effect is +ve,and an output of -1 if treatment effect is -ve,or no output at all if it is uncertain.

4. False-sign rate


We may correspondingly define the false sign rate as

FSR := 𝔼 [ # incorrect sign decisions made# sign decisions made ]

4. False-sign rate

Weinstein, Ramdas ’19


We may correspondingly define the false sign rate as

FSR := 𝔼 [ # incorrect sign decisions made# sign decisions made ]

To control the FSR, just using the online FCR algorithm, and report the sign iff the CI does not contain zero.

Open Problems [5 mins]

1. Errors and incentives in large organizations

A large number of different teams run such A/B tests or randomized experiments

From the larger organization’s perspective, coordination is necessary to control FDR or FCR,since that might affect the bottom line of the company.




But each individual group or team might feel“why do we have to pay if some other groupis running lots of random tests/experiments”?




But each individual group or team might feel“why do we have to pay if some other groupis running lots of random tests/experiments”?

How do we align incentives? Should our notion of error be hierarchical?

1. A hierarchical FDR or FCR control?

Company

. . .

Product 1(Group 1)

Product 2(Group 2)

Product 15(Group 15)

desires FDR ≤ 0.1

The average of group FDRs does not give company FDR.

FDR is additive in the worst case: if each group separately controls FDR at 0.1, the company FDR could be trivial.

2. Utilizing contextual information

Often, we have contextual information abouteach visitor (sample), like age, gender, etc.These have been utilized for contextualbandit algorithms that minimize regret.

2. Utilizing contextual information

Often, we have contextual information abouteach visitor (sample), like age, gender, etc.These have been utilized for contextualbandit algorithms that minimize regret.

Is such information useful for hypothesis testing?How do we use contextual bandits for hypothesis testing?

3. Designing systems that fail loudly

When our assumptions are wrong, and the system is not behaving like intended or expected, howcan we automatically detect and report this?

3. Designing systems that fail loudly

When our assumptions are wrong, and the system is not behaving like intended or expected, howcan we automatically detect and report this?

Is it possible to design such self-critical systemsthat “announce” failures?

Summary [15 mins]

A selective history (inner process)

Time


Fisher (1925) null hypothesis testing basicsrandomization for causal inference

Time


Fisher (1925) null hypothesis testing basicsrandomization for causal inferenceWald (1948)

sequential probability ratio test(the first always-valid p-values)

Time




Robbins (1952) multi-armed bandits

Time




Darling & Robbins (1967) confidence sequences

(the first always valid CIs)


Time







Lai, Siegmund,… (1970s) confidence sequences, inference after stopping experiments

Time






Jennison & Turnbull (1980s) group sequential methods

(peeking only 2 or 3 times)


Lai, Siegmund,… (1970s) confidence sequences, inference after stopping experiments

Time

A selective history (outer process)

Time


Tukey (1953) an unpublished book on the problem of multiple comparisons

Time



Eklund & Seeger (1963) define false discovery proportion

suggested heuristic algorithm

Time




suggested heuristic algorithmBenjamini & Hochberg (1995) rediscovered Eklund-Seeger methodfirst proof of FDR control

Time





Benjamini & Yekutieli (2005) false coverage rate (FCR)first methods to control it

Benjamini & Hochberg (1995) rediscovered Eklund-Seeger methodfirst proof of FDR control

Time





Benjamini & Yekutieli (2005) false coverage rate (FCR)first methods to control it

Benjamini & Hochberg (1995) rediscovered Eklund-Seeger methodfirst proof of FDR control

Foster & Stine (2008) conceptualized online FDR control first method to control it

Time

In this tutorial, you learnt the basics of


• How to think about a single experimentA. Why peeking is an issue in practiceB. Why applying a t-test repeatedly inflates errorsC. Anytime confidence intervals and p-values



• How to think about a sequence of experiments A. Why selective reporting is an issue in practiceB. Why Benjamini-Hochberg fails in the online settingC. Online FCR and FDR controlling algorithms



• How to think about a sequence of experiments A. Why selective reporting is an issue in practiceB. Why Benjamini-Hochberg fails in the online settingC. Online FCR and FDR controlling algorithms

• How to think about doubly-sequential experimentationA. Using anytime CIs with online FCR controlB. Using anytime p-values with online FDR controlC. Handling asynchronous tests with local dependence

You also learnt some advanced topics:

You also learnt some advanced topics:• Within a single experiment:

A. Using bandits for hypothesis testingB. Quantiles can be estimated sequentiallyC. The pros and cons of running intersectionsD. SATE with adaptive randomization



• Across experiments: A. Error metrics with decaying-memoryB. The false sign rateC. Weighted error metricsD. Post-hoc analysis



• Across experiments: A. Error metrics with decaying-memoryB. The false sign rateC. Weighted error metricsD. Post-hoc analysis

• Open problems: A. Incentives/errors within hierarchical organizationsB. Utilizing contextual information for testingC. Designing systems that fail loudly

SOFTWARE

• Within a single experiment:Python package called “confseq”Maintained by Steve Howard (Berkeley)Frequent updates + wrappers for months to come

• Across experiments:R package called “onlineFDR”Maintained by David Robertson (Cambridge)Frequent updates + wrappers for months to come

References and links atwww.stat.cmu.edu/~aramdas/kdd19/

http://www.stat.cmu.edu/~aramdas/kdd19/

Collaborators from this talk

Martin Wainwright

Michael Jordan

TijanaZrnic

Steve Howard

JasjeetSekhon

Jon McAuliffe

Asaf Weinstein

Kevin Jamieson

Eugene Katsevich

Jinjin Tian

Akshay Balsubramani

DavidRobertson

Fanny Yang

Foundations of large-scale“doubly-sequential” experimentation

Aaditya Ramdas

Assistant ProfessorDept. of Statistics and Data Science Machine Learning Dept.Carnegie Mellon University

www.stat.cmu.edu/~aramdas/kdd19/

(KDD tutorial in Anchorage, on 4 Aug 2019)

Thank you! Questions?Funding welcomed!

Foundations of large-scale “doubly-sequential”...

Documents

Transcript of Foundations of large-scale “doubly-sequential”...