An Evaluation Framework for Personalization Strategy ...An Evaluation Framework for Personalization...
Transcript of An Evaluation Framework for Personalization Strategy ...An Evaluation Framework for Personalization...
An Evaluation Framework for Personalization StrategyExperiment Designs
C. H. Bryan [email protected]
Imperial College London & ASOS.com, UK
Emma J. McCoy
Imperial College London, UK
ABSTRACTOnline Controlled Experiments (OCEs) are the gold standard inevaluating the effectiveness of changes to websites. An importanttype of OCE evaluates different personalization strategies, whichpresent challenges in low test power and lack of full control ingroup assignment. We argue that getting the right experimentsetupโthe allocation of users to treatment/analysis groupsโshouldtake precedence of post-hoc variance reduction techniques in orderto enable the scaling of the number of experiments. We present anevaluation framework that, along with a few simple rule of thumbs,allow experimenters to quickly compare which experiment setupwill lead to the highest probability of detecting a treatment effectunder their particular circumstance.
KEYWORDSExperimentation, Experiment design, Personalization strategies,A/B testing.
1 INTRODUCTIONThe use of Online Controlled Experiments (OCEs, e.g. A/B tests)has become popular in measuring the impact of digital productsand services, and guiding business decisions on theWeb [10]. Majorcompanies report running thousands of OCEs on any given day [6,8, 14] and many startups exist purely to manage OCEs [2, 7].
A large number of OCEs address simple variations on elements ofthe user experience based on random splits, e.g. showing a differentcolored button to users based on a user ID hash bucket. Here, we areinterested in experiments that compare personalization strategies,complex sets of targeted user interactions that are common ine-commerce and digital marketing, and measure the change toa performance metric of interest (metric hereafter). Examples ofpersonalization strategies include the scheduling, budgeting andordering of marketing activities directed at a user based on theirpurchase history.
Experiments for personalization strategies face two unique chal-lenges. Firstly, strategies are often only applicable to a small fractionof the user base, and thus many simple experiment designs suf-fer from either a lack of sample size / statistical power, or dilutedmetric movement by including irrelevant samples [3]. Secondly,as users are not randomly assigned a priori, but must qualify tobe treated with a strategy via their actions or attributes, groupsof users subjected to different strategies cannot be assumed to bestatistically equivalent and hence are not directly comparable.
While there are a number of variance reduction techniques (in-cluding stratification and control variates [4, 12]) that partiallyaddress the challenges, the strata and control variates involved canvary dramatically from one personalization strategy experimentto another, requiring many ad hoc adjustments. As a result, such
techniques may not scale well when organizations design and runhundreds or thousands of experiments at any given time.
We argue that personalization strategy experiments should focuson the assignment of users from the strategies they qualified forto the treatment/analysis groups. We call this mapping process anexperiment setup. Identifying the best experiment setup increasesthe chance to detect any treatment effect. An experimentationframework can also reuse and switch between different setupsquickly with little custom input, ensuring the operation can scale.More importantly, the process does not hinder the subsequentapplication of variance reduction techniques, meaning that we canstill apply the techniques post hoc if required.
To date, many experiment setups exist to compare personaliza-tion strategies. An increasingly popular approach is to compare thestrategies using multiple control groups โ Quantcast calls it a dualcontrol [1], and Facebook calls it a multi-cell lift study [9]. In thetwo-strategy case, this involves running two experiments on tworandom partitions of the user base in parallel, with each experimentfurther splitting the respective partition into treatment/control andmeasuring the incrementality (the change in a metric as comparedto the case where nothing is done) of each strategy. The incremen-tality of the strategies are then compared against each other.
Despite the setup above gaining traction in display advertising,there is a lack of literature on whether it is a better setupโonethat has a higher sensitivity and/or apparent effect size than othersetups. While [9] noted that multi-cell lift studies require a largenumber of users, they did not discuss how the number compares toother setups.1 The ability to identify and adopt a better experimentsetup can reduce the required sample size, and hence enable morecost-effective experimentation.
We address the gap in the literature by introducing an evaluationframework that compares experiment setups given two personal-ization strategies. The framework is designed to be flexible so thatit is able to deal with a wide range of baselines and changes in userresponses presented by any pairs of strategies (situations hereafter).We also recognize the need to quickly compare common setups, andprovide some simple rule of thumbs on situations where a setupwill be better than another. In particular, we outline the conditionswhere employing a multi-cell setup, as well as metric dilution, isdesirable.
To summarize, our contributions are: (i) We develop a flexibleevaluation framework for personalization strategy experiments,where one can compare two experiment setups given the situationpresented by two competing strategies (Section 2); (ii) We providesimple rule of thumbs to enable experimenters who do not requirethe full flexibility of the framework to quickly compare common
1A single-cell lift study is often used to measure the incrementality of a single person-alization strategy, and hence is not a representative comparison.
arX
iv:2
007.
1163
8v2
[st
at.M
E]
17
Dec
202
0
AdKDD โ20, Aug 2020, Online Liu and McCoy
Qualifyforstrategy1
Qualifyforstrategy2
Group 0:Qualifyforneitherstrategy
Group1 Group2Group3
Figure 1: Venn diagram of the user groups in our evaluationframework. The outer, left inner (red), and right inner (blue)boxes represent the entire user base, those who qualify forstrategy 1, and those who qualify for strategy 2 respectively.
setups (Section 3); and (iii) We make our results useful to practi-tioners by making the code used in the paper (Section 4) publiclyavailable.2
2 EVALUATION FRAMEWORKWe first present our evaluation framework for personalization strat-egy experiments. The experiments compare two personalizationstrategies, which we refer to as strategy 1 and strategy 2. Often oneof them is the existing strategy, and the other is a new strategy weintend to test and learn from. In this section we introduce (i) howusers qualifying themselves into strategies creates non-statisticallyequivalent groups, (ii) how experimenters usually assign the users,and (iii) when we would consider an assignment to be better.
2.1 User groupingAs users qualify themselves into the two strategies, four disjointgroups emerge: those who qualify for neither strategy, those whoqualify only for strategy 1, those who qualify only for strategy 2,and those who qualify for both strategies. We denote these groups(user) groups 0, 1, 2, and 3 respectively (see Figure 1). It is perhapsobvious that we cannot assume those in different user groups arestatistically equivalent and compare them directly.
We assume groups 0, 1, 2, 3 have ๐0, ๐1, ๐2, and ๐3 users respec-tively. We also assume responses from users (which we aggregate,often by taking the mean, to obtain our metric) are distributed dif-ferently between groups, and within the same group, between thescenario where the group is subjected to the treatment associatedto the corresponding strategy and where nothing is done (baseline).We list all group-scenario combinations in Table 1, and denote themean and variance of the responses (`๐บ , ๐2
๐บ) for a combination๐บ .3
2.2 Experiment setupsMany experiment setups exist and are in use in different organiza-tions. Here we introduce four common setups of various sophisti-cation, which we also illustrate in Figure 2.
2The code and supplementary document are available on: https://github.com/liuchbryan/experiment_design_evaluation.3For example, the responses for group 1 without any interventions has mean andvariance (`๐ถ1 ,๐2
๐ถ1), and that for group 2with the treatment prescribed under strategy 2has mean and variance (`๐ผ2 , ๐2
๐ผ2).
โข โข โข โข โข โข โข โข โขโข โข โข โข โข โข โข โข โขโข โข โข โข โข โข โข โข โข
โข โข โข โข โข โข โข โข โข โข โขโข โข โข โข โข โข โข โข โข โข โขโข โข โข โข โข โข โข โข โข โข โข
A
B
A
B
Setup1 Setup2
A
B
Setup3
โข โข โขโข โข โขโข โข โข
โพ โพ โพ โพ โพ โพ โพ โพโพ โพ โพ โพ โพ โพ โพ โพโพ โพ โพ โพ โพ โพ โพ โพ
โพ โพ โพ โพ โพ โพ โพ โพโพ โพ โพ โพ โพ โพ โพ โพโพ โพ โพ โพ โพ โพ โพ โพ
โพ โพโพ โพโพ โพ
โข โข โข โข โข โข โข โข โข โข โข โข โข โข โข โข
A=A2-A1
B=B2-B1
Setup4
โพ โพ โพ โพ โพ โพ โพ โพโพ โพ โพ โพ โพ โพ โพ โพ
A1A2
B1B2
Figure 2: Experiment setups overlaid on the user groupingVenn diagram in Figure 1. The hatched boxes indicate whoare included in the analysis, and the downward triangles anddots indicate who are subjected to treatment prescribed un-der strategies 1 and 2 respectively. See Section 2.2 for a de-tailed description.
Setup 1 (Users in the intersection only). The setup considers userswho qualify for both strategies only. The said users are randomlysplit (usually 50/50) into two (analysis) groups ๐ด and ๐ต, and areprescribed the treatment specified by strategies 1 and 2 respectively.The setup is easy to implement, though it is difficult to translateany learnings obtained from the setup to other user groups (e.g.those who qualify for one strategy only) [5].
Setup 2 (All samples). The setup is a simple A/B test where itconsiders all users, regardless on whether they qualify for anystrategy or not. The users are randomly split into two analysisgroups ๐ด and ๐ต, and are prescribed the treatment specified bystrategy 1(2) if (i) they qualify under the strategy and (ii) they are inanalysis group ๐ด(๐ต). This setup is easiest to implement but usuallysuffers severely from a dilution in metric [3].
Setup 3 (Qualified users only). The setup is similar to Setup 2except only those who qualified for at least one strategy (โtriggeredโusers in some literature [3]) are included in the analysis groups. Thesetup sits between Setup 1 and Setup 2 in terms of user coverage,and has the advantage of capturing the most number of usefulsamples yet having the least metric dilution. However, the setupalso prevents one from telling the incrementality of a strategy itself,but only the difference in incrementalities between two strategies.
Setup 4 (Dual control / multi-cell lift test). As described in Sec-tion 1, the setup first split the users randomly into two randomiza-tion groups. For the first randomization group, we consider thosewho qualify for strategy 1, and split them into analysis groups ๐ด1and ๐ด2. Group ๐ด2 receives the treatment prescribed under strat-egy 1, and group ๐ด1 acts as control. The incrementality for strat-egy 1 is then the difference in metric between groups๐ด2 and๐ด1. We
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD โ20, Aug 2020, Online
Group 0 Group 1 Group 2 Group 3Baseline (Control) ๐ถ0 ๐ถ1 ๐ถ2 ๐ถ3
Under treatment (Intervention) / ๐ผ1 ๐ผ2 Under strategy 1: ๐ผ๐
Under strategy 2: ๐ผ๐
Table 1: All group-scenario combinations in our evaluation framework for personalization strategy experiments. The columnsrepresent the groups described in Figure 1. The baseline represents the scenario where nothing is done. We assume those whoqualify for both strategies (Group 3) can only receive treatment(s) associated to either strategy.
apply the same process to the second randomization group, withstrategy 2 and analysis groups ๐ต1 and ๐ต2 in place, and comparethe incrementality for strategies 1 and 2. The setup allows one toobtain the incrementality of each individual strategy and minimizesmetric dilution. Though it also leaves a number of samples unusedand creates extra analysis groups, and hence generally suffers froma low test power [9].
2.3 Evaluation criteriaThere are a number of considerations when one evaluates compet-ing experiment setups. These include technical considerations suchas the complexity of setting up the setups on an OCE framework,and business considerations such as whether the incrementality ofindividual strategies is required.
Here we focus on the statistical aspect and propose two evalua-tion criteria: (i) the actual average treatment effect size as presentedby the two analysis groups in an experiment setup, and (ii) the sen-sitivity of the experiment represented by the minimum detectableeffect (MDE) under a pre-specified test power. Both criteria arenecessary as the former indicates whether a setup suffers from met-ric dilution, whereas the latter indicates whether the setup suffersfrom lack of power/sample size. An ideal setup should yield a highactual effect size and a high sensitivity (i.e. a low MDE),4 thoughas we observe in the next section it is usually a trade-off.
We formally define the two evaluation criteria from first prin-ciples while introducing relevant notations along the way. Let ๐ดand ๐ต be the two analysis groups in an experiment setup, with userresponses randomly distributed with mean and variance (`๐ด, ๐2
๐ด)
and (`๐ต, ๐2๐ต) respectively. We first recall that if there are sufficient
samples, the sample mean of the two groups approximately followsthe normal distribution by the Central Limit Theorem:
๐ดapprox.โผ N
(`๐ด, ๐
2๐ด/๐๐ด
), ๐ต
approx.โผ N(`๐ต, ๐
2๐ต/๐๐ต
), (1)
where ๐๐ด and ๐๐ต are the number of samples taken from ๐ด and ๐ต
respectively. The difference in the sample means then also approxi-mately follows a normal distribution:
๏ฟฝฬ๏ฟฝ โ (๐ต โ๐ด) approx.โผ N(ฮ โ `๐ต โ `๐ด, ๐
2๏ฟฝฬ๏ฟฝโ ๐2
๐ด/๐๐ด + ๐2๐ต/๐๐ต
). (2)
Here, ฮ is the actual effect size that we are interested in.The definition of the MDE \โ requires a primer to the power of
a statistical test. A common null hypothesis statistical test in per-sonalization strategy experiments uses the two-tailed hypotheses๐ป0 : ฮ = 0 and ๐ป1 : ฮ โ 0, with the test statistic under ๐ป0 being:
๐ โ ๐ท/๐๏ฟฝฬ๏ฟฝapprox.โผ N (0, 1) . (3)
4We will use the terms โhigh(er) sensitivityโ and โlow(er) MDEโ interchangeably.
We recall the null hypothesis will be rejected if |๐ | > ๐ง1โ๐ผ/2, where๐ผ is the significance level and ๐ง1โ๐ผ/2 is the 1 โ ๐ผ/2 quantile of astandard normal. Under a specific alternate hypothesis ฮ = \ , thepower is specified as
1 โ ๐ฝ\ โ Pr(|๐ | > ๐ง1โ๐ผ/2 | ฮ = \
)โ 1 โ ฮฆ
(๐ง1โ๐ผ/2 โ |\ |/๐๏ฟฝฬ๏ฟฝ
). (4)
where ฮฆ denotes the cumulative density function of a standardnormal.5 To achieve a minimum test power ๐min, we require that1 โ ๐ฝ\ > ๐min. Substituting Equation (4) into the inequality andrearranging to make \ the subject yields the effect sizes that thetest will be able to detect with the specified power:
|\ | > (๐ง1โ๐ผ/2 โ ๐ง1โ๐min ) ๐๏ฟฝฬ๏ฟฝ . (5)
\โ is then defined as the positive minimum \ that satisfies Inequal-ity (5), i.e. that specified by the RHS of the inequality.
We finally define what it means to be better under these evalu-ation criteria. WLOG we assume the actual effect size of the twocompeting experiment setups are positive,6 and say a setup ๐ issuperior to another setup ๐ if, all else being equal,(i) ๐ produces a higher actual effect size (ฮ๐ > ฮ๐ ) and a lower
minimum detectable effect size (\โ๐< \โ
๐ ), or
(ii) The gain in actual effect is greater than the loss in sensitivity:
ฮ๐ โ ฮ๐ > \โ๐ โ \โ๐ , (6)
which means an actual effect still stands a higher chance to beobserved under ๐ .
3 COMPARING EXPERIMENT SETUPSHaving described the evaluation framework above, in this sectionwe use the framework to compare the common experiment setupsdescribed in Section 2.2. We will first derive the actual effect sizeand MDE for each setup in Section 3.1, and using the result to createrule of thumbs on (i) whether diluting the metric by including userswho qualify for neither strategies is beneficial (Section 3.2) and (ii)if dual control is a better setup for personalization strategy experi-ments (Section 3.3), two questions that are often discussed amonge-commerce and marketing-focused experimenters. For brevity, werelegate most of the intermediate algebraic work when derivingthe actual & minimum detectable effect sizes, as well as the con-ditions that lead to a setup being superior, to our supplementarydocument.2
5The approximation in Equation (4) is tight for experiment design purposes, where๐ผ < 0.2 and 1 โ ๐ฝ > 0.6 for nearly all cases.6If both the actual effect sizes are negative, we simply swap the analysis groups. If theactual effect sizes are of opposite signs, it is likely an error.
AdKDD โ20, Aug 2020, Online Liu and McCoy
3.1 Actual & minimum detectable effect sizesWe first present the actual effect size and MDE of the four experi-ment setups. For each setup we first compute the sample size, meanresponse, and response variance in each analysis group, whicharises as a mixture of user groups described in Section 2.1. Forbrevity, we only present the quantities for one analysis group persetupโexpressions for other analysis group(s) can be easily ob-tained by substituting in the corresponding user groups. We thensubstitute the quantities computed into the definitions of ฮ (seeEquation (2)) and \โ (see Inequality (5)) to obtain the setup-specificactual effect size and MDE. We assume all random splits are done50/50 in these setups to maximize the test power.
Setup 1 (Users in the intersection only). The setup randomly splitsuser group 3 into two analysis groups, each with ๐3/2 samples. Usersin analysis group ๐ด are provided treatment under strategy 1, andhence the groupโs responses have a mean and variance of (`๐ผ๐ , ๐2
๐ผ๐).
The actual effect size and MDE for Setup 1 are hence:
ฮ๐1 = `๐ผ๐ โ `๐ผ๐ , (7)
\โ๐1 = (๐ง1โ๐ผ/2 โ ๐ง1โ๐min )โ๏ธ
2(๐2๐ผ๐
+ ๐2๐ผ๐)/๐3 . (8)
Setup 2 (All samples). This setup also contains two analysisgroups, ๐ด and ๐ต, each taking half of the population (i.e. (๐0 + ๐1 +๐2 + ๐3)/2). The mean response and response variance for groups๐ด and ๐ต are the weighted mean response and response varianceof the constituent user groups respectively. As we only providetreatment to those who qualify for strategy 1 in group ๐ด, and like-wise for group ๐ต with strategy 2, each user group will give differentresponses, e.g. for group ๐ด:
`๐ด = (๐0`๐ถ0 + ๐1`๐ผ1 + ๐2`๐ถ2 + ๐3`๐ผ๐ )/(๐0 + ๐1 + ๐2 + ๐3), (9)
๐2๐ด = (๐0๐
2๐ถ0 + ๐1๐
2๐ผ1 + ๐2๐
2๐ถ2 + ๐3๐
2๐ผ๐)/(๐0 + ๐1 + ๐2 + ๐3). (10)
Substituting the above (and that for group ๐ต) into the definitionsof actual effect size and MDE we have:
ฮ๐2 =๐1 (`๐ถ1 โ `๐ผ1) + ๐2 (`๐ผ2 โ `๐ถ2) + ๐3 (`๐ผ๐ โ `๐ผ๐ )
๐0 + ๐1 + ๐2 + ๐3, (11)
\โ๐2 = (๐ง1โ๐ผ/2 โ ๐ง1โ๐min )
โโโโโโ 2(๐0 (2๐2
๐ถ0) + ๐1 (๐2๐ผ1 + ๐2
๐ถ1)+๐2 (๐2
๐ถ2 + ๐2๐ผ2) + ๐3 (๐2
๐ผ๐+ ๐2
๐ผ๐))
(๐0 + ๐1 + ๐2 + ๐3)2 .
(12)
Setup 3 (Qualified users only). The setup is very similar to Setup 2,with members from user group 0 excluded. This leads to both anal-ysis groups having (๐1 + ๐2 + ๐3)/2 users. The absence of group 0users means they are not featured in the weighted mean responseand response variance of the two analysis groups, e.g. for group A:
`๐ด =๐1`๐ผ1 + ๐2`๐ถ2 + ๐3`๐ผ๐
๐1 + ๐2 + ๐3, ๐2
๐ด =
๐1๐2๐ผ1 + ๐2๐2
๐ถ2 + ๐3๐2๐ผ๐
๐1 + ๐2 + ๐3. (13)
This leads to the following actual effect size and MDE for Setup 3:
ฮ๐3 =๐1 (`๐ถ1 โ `๐ผ1) + ๐2 (`๐ผ2 โ `๐ถ2) + ๐3 (`๐ผ๐ โ `๐ผ๐ )
๐1 + ๐2 + ๐3, (14)
\โ๐3 =(๐ง1โ๐ผ/2โ๐ง1โ๐min )
โโโ2(๐1 (๐2
๐ผ1+๐2๐ถ1)+๐2 (๐2
๐ถ2+๐2๐ผ2)+๐3 (๐2
๐ผ๐+๐2
๐ผ๐))
(๐1 + ๐2 + ๐3)2 .
(15)
Setup 4 (Dual control). The setup is the odd one out as it has fouranalysis groups. Two of the analysis groups (๐ด1 and ๐ด2) are drawnfrom those who qualified into strategy 1 and are allocated into thefirst randomization group, and the other two (๐ต1 and ๐ต2) are drawnfrom those who are qualified into strategy 2 and are allocated intothe second randomization group:
๐๐ด1 = ๐๐ด2 = (๐1 + ๐3)/4 , ๐๐ต1 = ๐๐ต2 = (๐2 + ๐3)/4. (16)
The mean response and response variance for group ๐ด1 are:
`๐ด1 =๐1`๐ถ1 + ๐3`๐ถ3
๐1 + ๐3, ๐2
๐ด1 =๐1๐2
๐ถ1 + ๐3๐2๐ถ3
๐1 + ๐3. (17)
As the setup takes the difference of differences in the metric (i.e.the difference between the mean response for groups ๐ต2 and ๐ต1,and the difference between the mean response for groups ๐ด2 and๐ด1),7 the actual effect size is as follows:
ฮ๐4 = (`๐ต2 โ `๐ต1) โ (`๐ด2 โ `๐ด1)
=๐2 (`๐ผ2โ`๐ถ2) + ๐3 (`๐ผ๐ โ`๐ถ3)
๐2 + ๐3โ๐2 (`๐ผ1โ`๐ถ1) + ๐3 (`๐ผ๐โ`๐ถ3)
๐1 + ๐3.
(18)
The MDE for Setup 4 is similar to that specified in RHS of Inequal-ity (5), albeit with more groups:
\โ๐4 = (๐ง1โ๐ผ/2โ๐ง1โ๐min )โ๏ธ๐2๐ด1/๐๐ด1 + ๐2
๐ด2/๐๐ด2 + ๐2๐ต1/๐๐ต1 + ๐2
๐ต2/๐๐ต2
= 2 ยท (๐ง1โ๐ผ/2 โ ๐ง1โ๐min )รโโ๐1 (๐2
๐ถ1+๐2๐ผ1)+๐3 (๐2
๐ถ3+๐2๐ผ๐)
(๐1 + ๐3)2 +๐2 (๐2
๐ถ2+๐2๐ผ2)+๐3 (๐2
๐ถ3+๐2๐ผ๐)
(๐2 + ๐3)2 .
(19)
3.2 Is dilution always bad?The use of responses from users who do not qualify for any ofthe strategies we are comparing, an act known as metric dilution,has stirred countless debates in experimentation teams. On onehand, responses from these users make any treatment effect lesspronounced by contributing exactly zero; on the other hand, itmight be necessary as one does not know who actually qualify fora strategy [9], or it might be desirable as they can be leveraged toreduce the variance of the treatment effect estimator [3].
Here, we are interested in whether we should engage in the actof dilution given the assumed user responses prior to an experi-ment. This can be clarified by understanding the conditions whereSetup 3 would emerge superior (as defined in Section 2.3) to Setup 2.By inspecting Equations (11) and (14), it is clear that ฮ๐3 > ฮ๐2
7Not to be confused with the difference-in-differences method, which captures themetric for the control and treatment groups at both the beginning (pre-intervention)and the end (post-intervention) of the experiment. Here we only capture the metriconce at the end of the experiment.
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD โ20, Aug 2020, Online
if ๐0 > 0. Thus, Setup 3 is superior to Setup 2 under the first crite-rion if \โ
๐3 < \โ๐2, which is the case if ๐2
๐ถ0, the metric variance ofusers who qualify for neither strategies, is large. This can be shownby substituting Equations (12) and (15) into the \โ-inequality andrearranging the terms to obtain:(
๐1 (๐2๐ผ1 + ๐2
๐ถ1) + ๐2 (๐2๐ถ2 + ๐2
๐ผ2) + ๐3 (๐2๐ผ๐
+ ๐2๐ผ๐))ยท
(๐0 + 2๐1 + 2๐2 + 2๐3)2(๐1 + ๐2 + ๐3)2 < ๐2
๐ถ0 .
(20)
If we assume the response variance are similar across groups withusers who qualified for at least one strategy, i.e. ๐2
๐ผ1 โ ๐2๐ถ1 โ ยท ยท ยท โ
๐2๐ผ๐
โ ๐2๐, Inequality (20) can then be simplified as
๐2๐
(๐0
๐1 + ๐2 + ๐3+ 2
)< ๐2
๐ถ0 , (21)
where it can be used to quickly determine if one should considerdilution at all.
If Inequality (20) is not true (i.e. \โ๐3 โฅ \โ
๐2), we should thenconsider when the second criterion (i.e. ฮ๐3 โ ฮ๐2 > \โ
๐3 โ \โ๐2) is
met. Writing
[ = ๐1 (`๐ถ1 โ `๐ผ1) + ๐2 (`๐ผ2 โ `๐ถ2) + ๐3 (`๐ผ๐ โ `๐ผ๐ ),b = ๐1 (๐2
๐ถ1 + ๐2๐ผ1) + ๐2 (๐2
๐ผ2 + ๐2๐ถ2) + ๐3 (๐2
๐ผ๐+ ๐2
๐ผ๐), and
๐ง = ๐ง1โ๐ผ/2 โ ๐ง1โ๐min ,
we can substitute Equations (11), (12), (14), and (15) into the in-equality and rearrange to obtain
๐1 + ๐2 + ๐3๐0
โ๏ธ2๐0๐2
๐ถ0 + b >๐0 + ๐1 + ๐2 + ๐3
๐0
โ๏ธb โ [
โ2๐ง
. (22)
As the LHS of Inequality (22) is always positive, Setup 3 is superiorif the RHS โค 0. Noting
ฮ๐3 = [/(๐1 + ๐2 + ๐3) and \โ๐3 =โ
2 ยท ๐ง ยทโ๏ธb/(๐1 + ๐2 + ๐3),
the trivial case is satisfied if๐0 + ๐1 + ๐2 + ๐3
๐0ยท \โ๐3 โค ฮ๐3 . (23)
If the RHS of Inequality (22) is positive, we can safely squareboth sides and use the identities for ฮ๐3 and \โ
๐3 to get
2๐2๐ถ0๐0
>
[(\โ๐3 โ ฮ๐3 + ๐1+๐2+๐3
๐0\โ๐3
)2โ
(๐1+๐2+๐3
๐0\โ๐3
)2]
2๐ง2 . (24)
As the LHS is always positive, the second criterion is met if
\โ๐3 โค ฮ๐3 . (25)
Note this is a weaker, and thus more easily satisfiable conditionthan that introduced in Inequality (23). This suggests an experimentsetup is always superior to an diluted alternative if the experiment isalready adequately poweredโintroducing any dilution will simplymake things worse.
Failing the condition in Inequality (25), we can always fall backto Inequality (24). While the inequality operates in squared space, itis essentially comparing the standard error of user group 0 (LHS)โthose who qualify for neither strategiesโto the gap between theminimum detectable and actual effects (\โ
๐3 โ ฮ๐3). The gap canbe interpreted as the existing noise level, thus a higher standard
error means mixing in group 0 users will introduce extra noise,and one is better off without them. Conversely, a smaller standarderror means group 0 users can lower the noise level, i.e. stabilizethe metric fluctuation, and one should take advantage of them.
To summarize, diluting a personalization strategies experimentsetup is not helpful if(i) Users who do not qualify for any strategies have a large metric
variance (Inequality (20)), or(ii) The experiment is already adequately powered (Inequality (25)).It could help if the experiment has not gained sufficient power yetand users who do not qualify for any strategy provide low-varianceresponses, such that they exhibit stabilizing effects when includedinto the analysis (complement of Inequality (24)).
3.3 When is a dual-control more effective?Often when advertisers compare two personalization strategies,the question on whether to use a dual control/multi-cell designcomes up. Proponents of such approach celebrate its ability to tella story by making the incrementality of an individual strategyavailable, while opponents voice concerns on the complexity insetting up the design. Here we are interested if Setup 4 (dual control)is superior to Setup 3 (a simple A/B test) from a power/detectableeffect perspective, and if so, under what circumstances.
We first observe \โ๐4 > \
โ๐3 is always true, and hence a dual control
setup will never be superior to a simpler setup under the first crite-rion. This can be verified by substituting in Equations (19) and (15)and rearranging the terms to show the inequality is equivalent to
2(
๐1(๐1+๐3)2 (๐2
๐ถ1 + ๐2๐ผ1) +
๐2(๐2+๐3)2 (๐2
๐ถ2 + ๐2๐ผ2)+
๐3(๐1+๐3)2 ๐
2๐ผ๐
+ ๐3(๐2+๐3)2 ๐
2๐ผ๐
+(
๐3(๐1+๐3)2 + ๐3
(๐2+๐3)2
)๐2๐ถ3
)>
๐1(๐1+๐2+๐3)2 (๐2
๐ถ1 + ๐2๐ผ1) +
๐2(๐1+๐2+๐3)2 (๐2
๐ถ2 + ๐2๐ผ2)+
๐3(๐1+๐2+๐3)2 ๐
2๐ผ๐
+ ๐3(๐1+๐2+๐3)2 ๐
2๐ผ๐, (26)
which is always true given the ๐s are non-negative and the ๐2s arepositive: not only the coefficients of the ๐2-terms are larger on theLHS than their RHS counterparts, the LHS also carries an extra ๐2
๐ถ3term with non-negative coefficient and a factor of two.
Moving on to the second evaluation criterion, we recall thatSetup 4 is superior if ฮ๐4 โ ฮ๐3 > \โ
๐4 โ \โ๐3, otherwise Setup 3 is
superior under the same criterion. The full flexibility of the modelcan be seen by substituting Equations (14), (15), (18), and (19) intothe inequality and rearrange to obtain
๐1๐2 (`๐ผ2โ`๐ถ2)+๐3 (`๐ผ๐โ`๐ถ3)
๐2+๐3โ ๐2
๐1 (`๐ผ1โ`๐ถ1)+๐3 (`๐ผ๐โ`๐ถ3)๐1+๐3โ๏ธ
๐1 (๐2๐ถ1 + ๐2
๐ผ1) + ๐2 (๐2๐ถ2 + ๐2
๐ผ2) + ๐3 (๐2๐ผ๐
+ ๐2๐ผ๐)
> (27)
โ2๐ง
โโโโโโโโโ2 ยท
(1 + ๐2๐1+๐3
)2 [๐1 (๐2๐ถ1 + ๐2
๐ผ1) + ๐3 (๐2๐ถ3 + ๐2
๐ผ๐)]+
(1 + ๐1๐2+๐3
)2 [๐2 (๐2๐ถ2 + ๐2
๐ผ2) + ๐3 (๐2๐ถ3 + ๐2
๐ผ๐)]
๐1 (๐2๐ถ1 + ๐2
๐ผ1) + ๐2 (๐2๐ถ2 + ๐2
๐ผ2) + ๐3 (๐2๐ผ๐
+ ๐2๐ผ๐)
โ 1
,where ๐ง = ๐ง1โ๐ผ/2 โ ๐ง1โ๐min .
A key observation from inspecting Inequality (27) is that theLHS of the inequality scales along๐ (
โ๐), where ๐ is the number of
users, while the RHS remains a constant. This leads to the insight
AdKDD โ20, Aug 2020, Online Liu and McCoy
that Setup 4 is more likely to be superior if the ๐s are large. Herewe assume the ratio ๐1 : ๐2 : ๐3 remains unchanged when we scalethe number of samples, an assumption that generally holds whenan organization increases their reach while maintaining their usermix. It is worth pointing out that our claim is stronger than thatin previous work โ we have shown that having a large user basenot only fulfills the requirement of running a dual control experi-ment as described in [9], it also makes a dual control experiment abetter setup than its simpler counterparts in terms of apparent anddetectable effect sizes.
The scaling relationship can be seen more clearly if we applysome simplifying assumptions to the ๐2- and ๐-terms. If we as-sume the response variances are similar across user groups (i.e.๐2๐ถ1 โ ๐2
๐ผ1 โ ยท ยท ยท โ ๐2๐ผ๐
โ ๐2๐), the RHS of Inequality (27) becomes
โ2๐ง
[โ๏ธ๐1 + ๐2 + ๐3๐1 + ๐3
+ ๐1 + ๐2 + ๐3๐2 + ๐3
โ 1], (28)
which remains a constant if the ratio ๐1 : ๐2 : ๐3 remains un-changed. If we assume the number of users in groups 1, 2, 3 aresimilar (i.e. ๐1 = ๐2 = ๐3 = ๐), the LHS of Inequality (27) becomes
โ๐((`๐ผ2 โ `๐ถ2) โ (`๐ผ1 โ `๐ถ1) + `๐ผ๐ โ `๐ผ๐
)2โ๏ธ๐2๐ถ1 + ๐2
๐ผ1 + ๐2๐ถ2 + ๐2
๐ผ2 + ๐2๐ผ๐
+ ๐2๐ผ๐
, (29)
which clearly scales along ๐ (โ๐).
We conclude the section by providing an indication on what alarge ๐ may look like. If we assume both the response variancesand the number of users are are similar across user groups, we canrearrange Inequality (27) to make ๐ the subject:
๐ >
(2โ
12(โ
6 โ 1)๐ง
)2 ๐2๐
ฮ2 , (30)
where ฮ = (`๐ผ2 โ `๐ถ2) โ (`๐ผ1 โ `๐ถ1) + `๐ผ๐ โ `๐ผ๐ is the difference inactual effect sizes between Setups 4 and 3. Under a 5% significancelevel and 80% power, the first coefficient amounts to around 791,which is roughly 50 times the coefficient one would use to deter-mine the sample size of a simple A/B test [11]. This suggests a dualcontrol setup is perhaps a luxury accessible only to the largest ad-vertising platforms and their top advertisers. For example, consideran experiment to optimize conversion rate where the baselinesattain 20% (hence having a variance of 0.2(1 โ 0.2) = 0.16). If thereis a 2.5% relative (i.e. 0.5% absolute) effect between the competingstrategies, the dual control setup will only be superior if there are> 5M users in each user group.
4 EXPERIMENTSHaving performed theoretical calculations for the actual and de-tectable effects and conditions where an experiment setup is supe-rior to another, here we verify those calculations using simulationresults. We focus on the results presented in Section 3.1, as the restof the results presented followed from those calculations.
In each experiment setup evaluation, we randomly select thevalue of the parameters (i.e. the `s, ๐2s, and ๐s), and take 1,000actual effect samples, each by (i) sampling the responses from theuser groups under the specified parameters, (ii) computing themeanfor the analysis groups, and (iii) taking the difference of the means.
Actual effect size Minimum detectable effectSetup 1 1049/1099 (95.45%) 66/81 (81.48%)Setup 2 853/999 (85.38%) 87/106 (82.08%)Setup 3 922/1099 (83.89%) 93/116 (80.18%)Setup 4 240/333 (72.07%) 149/185 (80.54%)
Table 2: Number of evaluations where the theoretical valueof the quantities (columns) falls between the 95% bootstrapconfidence interval for each experiment setup (rows). SeeSection 4 for a detailed description on the evaluations.
We also take 100 MDE samples in separate evaluations, eachby (i) sampling a critical value under null hypothesis; (ii) comput-ing the test power under a large number of possible effect sizes,each using the critical value and sampled metric means under thealternate hypothesis; and (iii) searching the effect size space forthe value that gives the predefined power. As the power vs. effectsize curve is noisy given the use of simulated power samples, weuse the bisection algorithm provided by the noisyopt package toperform the search. The algorithm dynamically adjusts the numberof samples taken from the same point on the curve to ensure thenoise does not send us down the wrong search space.
We expect the mean of the sampled actual effect and MDE tomatch the theoretical value. To verify this, we perform 1,000 boot-strap resamplings on the samples obtained above to obtain an em-pirical bootstrap distribution of the sample mean in each evaluation.The 95% bootstrap resampling confidence interval (BRCI) shouldthen contain the theoretical mean 95% of the times. The histogramof the percentile rank of the theoretical quantity in relation to thebootstrap samples across multiple evaluations should also follow auniform distribution [13].
The result is shown in Table 2. One can observe that there aremore evaluations having their theoretical quantity lying outsidethan the BRCI than expected. Upon further investigation, we ob-served a characteristic โช-shape from the histograms of the per-centile ranks for the actual effects. This suggests the bootstrapsamples may be under-dispersed but otherwise centered on thetheoretical quantities.
We also observed the histograms forMDEs curving upward to theright, this suggests that the theoretical value is a slight overestimate(of < 1% to the bootstrap mean in all cases). We believe this is likelydue to a small bias in the bisection algorithm. The algorithm testsif the mean of the power samples is less than the target power todecide which half of the search space to continue along. Given wecan bisect up to 10 times in each evaluation, it is likely to see afalse positive even when we set the significance level for individualcomparisons to 1%. This leads to the algorithm favoring a smallerMDE sample. Having that said, since we have tested for a widerange of parameters and the overall bias is small, we are satisfiedwith the theoretical quantities for experiment design purposes.
5 CONCLUSIONWe have addressed the problem of comparing experiment designsfor personalization strategies by presenting an evaluation frame-work that allows experimenters to evaluate which experiment setup
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD โ20, Aug 2020, Online
should be adopted given the situation. The flexible framework canbe easily extended to compare setups that compare more than twostrategies by adding more user groups (i.e. new sets to the Venndiagram in Figure 1). A new setup can also be incorporated quicklyas it is essentially a different weighting of user group-scenariocombinations shown in Table 1. The framework also allows thedevelopment of simple rule of thumbs such as:(i) Metric dilution should never be employed if the experiment
already has sufficient power; though it can be useful if theexperiment is under-powered and the non-qualifying usersprovide a โstabilizing effectโ; and
(ii) A dual control setup is superior to simpler setups only if onehas access to the user base of the largest organizations.
We have validated the theoretical results via simulations, and madethe code available2 so that practitioners can benefit from the resultsimmediately when designing their upcoming experiments.
Future Work. So far we assume the responses from each usergroup-scenario combination are randomly distributed with themean independent to the variance, with the evaluation criteria cal-culated assuming the metric (being the weighted sample mean ofthe responses) is approximately normally distributed under CentralLimit Theorem. We are interested in whether the evaluation frame-work and its results still hold if we deviate from these assumptions,e.g. with binary responses (where the mean and variance correlate)and heavy-tailed distributed responses (where the sample meanconverges to a normal distribution slowly).
ACKNOWLEDGMENTSThework is partially funded by the EPSRCCDT inModern Statisticsand Statistical Machine Learning at Imperial andOxford (StatML.IO)and ASOS.com. The authors thank the anonymous reviewers forproviding many improvements to the original manuscript.
REFERENCES[1] Brooke Bengier and Amanda Knupp. [n.d.]. Selling More Stuff: The What, Why
& How of Incrementality Testing. https://www.quantcast.com/blog/selling-more-stuff-the-what-why-how-of-incrementality-testing/. Blog post.
[2] Will Browne and Mike Swarbrick Jones. 2017. What Works in e-commerce - aMeta-analysis of 6700 Online Experiments. http://www.qubit.com/wp-content/uploads/2017/12/qubit-research-meta-analysis.pdf. http://www.qubit.com/wp-content/uploads/2017/12/qubit-research-meta-analysis.pdf White paper.
[3] Alex Deng and Victor Hu. 2015. Diluted Treatment Effect Estimation for Trig-ger Analysis in Online Controlled Experiments. In Proceedings of the EighthACM International Conference on Web Search and Data Mining (Shanghai, China)(WSDM โ15). Association for Computing Machinery, New York, NY, USA, 349โ358.https://doi.org/10.1145/2684822.2685307
[4] Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the Sen-sitivity of Online Controlled Experiments by Utilizing Pre-experiment Data.In Proceedings of the Sixth ACM International Conference on Web Search andData Mining (Rome, Italy) (WSDM โ13). ACM, New York, NY, USA, 123โ132.https://doi.org/10.1145/2433396.2433413
[5] Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A DirtyDozen: Twelve Common Metric Interpretation Pitfalls in Online ControlledExperiments. In Proceedings of the 23rd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD โ17). ACM,New York, NY, USA, 1427โ1436.
[6] HenningHohnhold, Deirdre OโBrien, andDiane Tang. 2015. Focusing on the Long-term: Itโs Good for Users and Business. In Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining (Sydney, NSW,Australia) (KDD โ15). ACM, New York, NY, USA, 1849โ1858.
[7] Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. 2017. Peeking atA/B Tests: Why It Matters, and What to Do About It. In Proceedings of the 23rdACM SIGKDD International Conference on Knowledge Discovery and Data Mining(Halifax, NS, Canada) (KDD โ17). ACM, New York, NY, USA, 1517โ1525.
[8] Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann.2013. Online Controlled Experiments at Large Scale. In Proceedings of the 19thACM SIGKDD International Conference on Knowledge Discovery and Data Mining(Chicago, Illinois, USA) (KDD โ13). ACM, New York, NY, USA, 1168โ1176.
[9] C. H. Bryan Liu, Elaine M. Bettaney, and Benjamin Paul Chamberlain.2018. Designing Experiments to Measure Incrementality on Facebook.arXiv:1806.02588. [stat.ME] 2018 AdKDD & TargetAd Workshop.
[10] C. H. Bryan Liu, Benjamin Paul Chamberlain, and Emma J. McCoy. 2020. What isthe Value of Experimentation and Measurement?: Quantifying the Value and Riskof Reducing Uncertainty to Make Better Decisions. Data Science and Engineering5, 2 (2020), 152โ167. https://doi.org/10.1007/s41019-020-00121-5
[11] EvanMiller. 2010. HowNot To Run anA/B Test. https://www.evanmiller.org/how-not-to-run-an-ab-test.html. Blog post.
[12] Alexey Poyarkov, Alexey Drutsa, Andrey Khalyavin, Gleb Gusev, and PavelSerdyukov. 2016. Boosted Decision Tree Regression Adjustment for VarianceReduction in Online Controlled Experiments. In Proceedings of the 22Nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (SanFrancisco, California, USA) (KDD โ16). ACM, New York, NY, USA, 235โ244. https://doi.org/10.1145/2939672.2939688
[13] Sean Talts, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gel-man. 2018. Validating Bayesian Inference Algorithms with Simulation-BasedCalibration. arXiv:1804.06788 [stat.ME]
[14] Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015.From Infrastructure to Culture: A/B Testing Challenges in Large Scale SocialNetworks. In Proceedings of the 21th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (Sydney, NSW, Australia) (KDD โ15). ACM,New York, NY, USA, 2227โ2236.
AdKDD โ20, Aug 2020, Online Liu and McCoy
SUPPLEMENTARY DOCUMENTIn this supplementary document, we revisit Section 3 of the paperโAn Evaluation Framework for Personalization Strategy ExperimentDesignโ by Liu and McCoy (2020), and expand on the intermediatealgebraic work when deriving:(i) The actual & minimum detectable effect sizes, and(ii) The conditions that lead to a setup being superior.We will use the equation numbers featured in the original paper,and append letters for intermediate steps.
A EFFECT SIZE OF EXPERIMENT SETUPSWe begin with the actual effect size and the MDE of the four ex-periment setups that are featured in Section 3.1 of the paper. Foreach setup we first compute the sample size, mean response, andresponse variance in each analysis group (denoted ๐๐ , `๐ , and ๐2
๐
respectively for each analysis group ๐). These quantities arise asa mixture of user groups described in Section 2.1 of the paper. Wethen substitute the quantities computed into the definitions of ฮand \โ:
ฮ โ `๐ต โ `๐ด, (see (2))
\โ โ (๐ง1โ๐ผ/2 โ ๐ง1โ๐min )๐๏ฟฝฬ๏ฟฝ , where ๐๏ฟฝฬ๏ฟฝ = ๐2๐ด/๐๐ด + ๐2
๐ต/๐๐ต (see (5))
and ๐ง๐ represents the ๐th quantile of a standard normal, to obtainthe setup-specific actual effect size and MDE for a setup with twoanalysis groups. For setups with more than two analysis groups,we will specify the actual and minimum detectable effect when wediscuss specifics for each of the setups. We assume all random splitsare done 50/50 in these setups to maximize the test power.
Setup 1 (Users in the intersection only). We recall the setup, whichconsiders only users who qualify for both personalization strategies(i.e. the intersection), randomly splits user group 3 into two analysisgroups, ๐ด and ๐ต, each with the following number of samples:
๐๐ด = ๐๐ต =๐32.
Users in analysis group ๐ด are provided treatment under strategy 1,and users in analysis group ๐ต are provided treatment under strat-egy 2. This leads to the groupsโ responses having the followingmetric mean and variance:
`๐ด = `๐ผ๐ , `๐ต = `๐ผ๐ , ๐2๐ด = ๐2
๐ผ๐, ๐2
๐ต = ๐2๐ผ๐.
The actual effect size and MDE for Setup 1 are hence:
ฮ๐1 = `๐ผ๐ โ `๐ผ๐ , (7)
\โ๐1 = (๐ง1โ๐ผ/2 โ ๐ง1โ๐min )
โโ๐2๐ผ๐
๐3/2+๐2๐ผ๐
๐3/2. (8)
Setup 2 (All samples). The setup, which considers all users regard-less of whether they qualify for any strategy or not, also containstwo analysis groups, ๐ด and ๐ต, each taking half of the population:
๐๐ด = ๐๐ต =๐0 + ๐1 + ๐2 + ๐3
2.
The mean response and response variance for groups ๐ด and ๐ต
are the weighted mean response and response variance of theconstituent user groups respectively, weighted by the constituentgroupsโ size. As we only provide treatment to those who qualify
for strategy 1 in group A, and those who qualify for strategy 2 ingroup B, this leads to different responses in different constituentuser groups:
`๐ด =๐0`๐ถ0 + ๐1`๐ผ1 + ๐2`๐ถ2 + ๐3`๐ผ๐
๐0 + ๐1 + ๐2 + ๐3, (9)
`๐ต =๐0`๐ถ0 + ๐1`๐ถ1 + ๐2`๐ผ2 + ๐3`๐ผ๐
๐0 + ๐1 + ๐2 + ๐3;
๐2๐ด =
๐0๐2๐ถ0 + ๐1๐2
๐ผ1 + ๐2๐2๐ถ2 + ๐3๐2
๐ผ๐
๐0 + ๐1 + ๐2 + ๐3, (10)
๐2๐ต =
๐0๐2๐ถ0 + ๐1๐2
๐ถ1 + ๐2๐2๐ผ2 + ๐3๐2
๐ผ๐
๐0 + ๐1 + ๐2 + ๐3.
Substituting the above into the definitions of actual effect sizeand MDE and simplifying the resultant expressions we have:
ฮ๐2 =๐1 (`๐ถ1 โ `๐ผ1) + ๐2 (`๐ผ2 โ `๐ถ2) + ๐3 (`๐ผ๐ โ `๐ผ๐ )
๐0 + ๐1 + ๐2 + ๐3, (11)
\โ๐2 = (๐ง1โ๐ผ/2 โ ๐ง1โ๐min )
โโโโโโ 2(๐0 (2๐2
๐ถ0) + ๐1 (๐2๐ผ1 + ๐2
๐ถ1)+๐2 (๐2
๐ถ2 + ๐2๐ผ2) + ๐3 (๐2
๐ผ๐+ ๐2
๐ผ๐))
(๐0 + ๐1 + ๐2 + ๐3)2 .
(12)
Setup 3 (Qualified users only). The setup is very similar to Setup 2,with members from user group 0 excluded:
๐๐ด = ๐๐ต =๐1 + ๐2 + ๐3
2.
The absence of members from user group 0 means they are notfeatured in the weighted mean response and response variance ofthe two analysis groups:
`๐ด =๐1`๐ผ1 + ๐2`๐ถ2 + ๐3`๐ผ๐
๐1 + ๐2 + ๐3, `๐ต =
๐1`๐ถ1 + ๐2`๐ผ2 + ๐3`๐ผ๐
๐1 + ๐2 + ๐3;
๐2๐ด =
๐1๐2๐ผ1 + ๐2๐2
๐ถ2 + ๐3๐2๐ผ๐
๐1 + ๐2 + ๐3, ๐2
๐ต =
๐1๐2๐ถ1 + ๐2๐2
๐ผ2 + ๐3๐2๐ผ๐
๐1 + ๐2 + ๐3. (13)
This lead to the following actual effect size and MDE for Setup 3:
ฮ๐3 =๐1 (`๐ถ1 โ `๐ผ1) + ๐2 (`๐ผ2 โ `๐ถ2) + ๐3 (`๐ผ๐ โ `๐ผ๐ )
๐1 + ๐2 + ๐3, (14)
\โ๐3 =(๐ง1โ๐ผ/2โ๐ง1โ๐min )
โโโ2(๐1 (๐2
๐ผ1+๐2๐ถ1)+๐2 (๐2
๐ถ2+๐2๐ผ2)+๐3 (๐2
๐ผ๐+๐2
๐ผ๐))
(๐1 + ๐2 + ๐3)2 .
(15)
Setup 4 (Dual control). Setup 4 is unique amongst the experimentsetups introduced as it has four analysis groups. Two of the analysisgroups are drawn from those who qualified into strategy 1 and areallocated into the first half, and the other two are drawn from thosewho are qualified into strategy 2 and are allocated into the secondhalf:
๐๐ด1 = ๐๐ด2 =๐1 + ๐3
4, ๐๐ต1 = ๐๐ต2 =
๐2 + ๐34
. (16)
The mean response and response variance of each analysis groupare the weighted metric mean response and response variance of
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD โ20, Aug 2020, Online
the user groups involved respectively:
`๐ด1 =๐1`๐ถ1 + ๐3`๐ถ3
๐1 + ๐3, `๐ด2 =
๐1`๐ผ1 + ๐3`๐ผ๐
๐1 + ๐3,
`๐ต1 =๐1`๐ถ2 + ๐3`๐ถ3
๐2 + ๐3, `๐ต2 =
๐1`๐ผ2 + ๐3`๐ผ๐
๐2 + ๐3;
๐2๐ด1 =
๐1๐2๐ถ1 + ๐3๐2
๐ถ3๐1 + ๐3
, ๐2๐ด2 =
๐1๐2๐ผ1 + ๐3๐2
๐ผ๐
๐1 + ๐3,
๐2๐ต1 =
๐1๐2๐ถ2 + ๐3๐2
๐ถ3๐2 + ๐3
, ๐2๐ต2 =
๐1๐2๐ผ2 + ๐3๐2
๐ผ๐
๐2 + ๐3; (17)
As the setup takes the difference of differences (i.e. the differenceof groups ๐ต2 and ๐ต1, and the difference of groups ๐ด2 and ๐ด1), theactual effect size are specified, post-simplification, as follows:
ฮ๐4 = (`๐ต2 โ `๐ต1) โ (`๐ด2 โ `๐ด1)
=๐2 (`๐ผ2โ`๐ถ2) + ๐3 (`๐ผ๐ โ`๐ถ3)
๐2 + ๐3โ๐2 (`๐ผ1โ`๐ถ1) + ๐3 (`๐ผ๐โ`๐ถ3)
๐1 + ๐3.
(18)
The MDE for Setup 4 is similar to that specified in the RHS ofInequality (5), albeit with more groups:
\โ๐4 = (๐ง1โ๐ผ/2โ๐ง1โ๐min )โ๏ธ๐2๐ด1/๐๐ด1 + ๐2
๐ด2/๐๐ด2 + ๐2๐ต1/๐๐ต1 + ๐2
๐ต2/๐๐ต2
= 2 ยท (๐ง1โ๐ผ/2 โ ๐ง1โ๐min )รโโ๐1 (๐2
๐ถ1+๐2๐ผ1)+๐3 (๐2
๐ถ3+๐2๐ผ๐)
(๐1 + ๐3)2 +๐2 (๐2
๐ถ2+๐2๐ผ2)+๐3 (๐2
๐ถ3+๐2๐ผ๐)
(๐2 + ๐3)2 .
(19)
B METRIC DILUTIONWe then expand the calculations in Section 3.2 of the paper, wherewe discuss the conditions where an experiment setup with met-ric dilution (Setup 2) will emerge superior to one without metricdilution (Setup 3), and vice versa.
B.1 The first criterionWe first show \โ
๐3 < \โ๐2, the condition which will lead to Setup 3
being superior to Setup 2 under the first criterion, is equivalent to(๐1 (๐2
๐ผ1 + ๐2๐ถ1) + ๐2 (๐2
๐ถ2 + ๐2๐ผ2) + ๐3 (๐2
๐ผ๐+ ๐2
๐ผ๐))ยท
(๐0 + 2๐1 + 2๐2 + 2๐3)2(๐1 + ๐2 + ๐3)2 < ๐2
๐ถ0 . (20)
We start by substituting the expressions for \โ๐2 (Equation (12))
and \โ๐3 (Equation (15)) into the inequality \โ
๐3 < \โ๐2 to obtain
(๐ง1โ๐ผ/2โ๐ง1โ๐min )
โโ2(๐1 (๐2
๐ผ1+๐2๐ถ1)+๐2 (๐2
๐ถ2+๐2๐ผ2)+๐3 (๐2
๐ผ๐+๐2
๐ผ๐))
(๐1 + ๐2 + ๐3)2
< (๐ง1โ๐ผ/2 โ ๐ง1โ๐min )
โโโโโโ 2(๐0 (2๐2
๐ถ0) + ๐1 (๐2๐ผ1 + ๐2
๐ถ1)+๐2 (๐2
๐ถ2 + ๐2๐ผ2) + ๐3 (๐2
๐ผ๐+ ๐2
๐ผ๐))
(๐0 + ๐1 + ๐2 + ๐3)2 .
(20a)
Canceling the ๐ง1โ๐ผ/2 โ๐ง1โ๐min andโ
2 terms on both sides, and thensquaring them yields
๐1 (๐2๐ผ1 + ๐2
๐ถ1) + ๐2 (๐2๐ถ2 + ๐2
๐ผ2) + ๐3 (๐2๐ผ๐
+ ๐2๐ผ๐)
(๐1 + ๐2 + ๐3)2
<๐0 (2๐2
๐ถ0) + ๐1 (๐2๐ผ1 + ๐2
๐ถ1) + ๐2 (๐2๐ถ2 + ๐2
๐ผ2) + ๐3 (๐2๐ผ๐
+ ๐2๐ผ๐)
(๐0 + ๐1 + ๐2 + ๐3)2 .
(20b)
We then write b = ๐1 (๐2๐ผ1 +๐
2๐ถ1) +๐2 (๐2
๐ถ2 +๐2๐ผ2) +๐3 (๐2
๐ผ๐+๐2
๐ผ๐)
and move the b terms on the RHS to the LHS:
b
(1
(๐1 + ๐2 + ๐3)2 โ 1(๐0 + ๐1 + ๐2 + ๐3)2
)<
๐0 (2๐2๐ถ0)
(๐0 + ๐1 + ๐2 + ๐3)2 .
(20c)
As the partial fractions can be consolidated as1
(๐1 + ๐2 + ๐3)2 โ 1(๐0 + ๐1 + ๐2 + ๐3)2
=(๐0 + ๐1 + ๐2 + ๐3)2 โ (๐1 + ๐2 + ๐3)2
(๐1 + ๐2 + ๐3)2 (๐0 + ๐1 + ๐2 + ๐3)2
=(๐0 + 2๐1 + 2๐2 + 2๐3)๐0
(๐1 + ๐2 + ๐3)2 (๐0 + ๐1 + ๐2 + ๐3)2 , (20d)
where the second step utilizes the identity ๐2 โ ๐2 = (๐ + ๐) (๐ โ ๐),Inequality (20c) can be written as
b
((๐0 + 2๐1 + 2๐2 + 2๐3)๐0
(๐1 + ๐2 + ๐3)2 (๐0 + ๐1 + ๐2 + ๐3)2
)<
๐0 (2๐2๐ถ0)
(๐0 + ๐1 + ๐2 + ๐3)2 .
(20e)
We finally cancel the ๐0 and (๐0 +๐1 +๐2 +๐3)2 terms on both sidesof Inequality (20e), move the factor of two to the LHS, and write bin its full form to arrive at Inequality (20):(
๐1 (๐2๐ผ1 + ๐2
๐ถ1) + ๐2 (๐2๐ถ2 + ๐2
๐ผ2) + ๐3 (๐2๐ผ๐
+ ๐2๐ผ๐))ยท
(๐0 + 2๐1 + 2๐2 + 2๐3)2(๐1 + ๐2 + ๐3)2 < ๐2
๐ถ0 .
B.2 The second criterionIn the case where Inequality (20) does not hold, we consider whenSetup 3 will emerge superior to Setup 2 under the second criterion:
ฮ๐3 โ ฮ๐2 > \โ๐3 โ \โ๐2 .
If this inequality does not hold either (and both sides are not equal),we consider Setup 2 as superior to Setup 3 under the same criterionas the following holds:
ฮ๐3 โ ฮ๐2 < \โ๐3 โ \โ๐2 โโ ฮ๐2 โ ฮ๐3 > \โ๐2 โ \โ๐3 .
The master inequality. We first show that the inequality ฮ๐3 โฮ๐2 > \โ
๐3 โ \โ๐2 is equivalent to
๐1 + ๐2 + ๐3๐0
โ๏ธ2๐0๐2
๐ถ0 + b >๐0 + ๐1 + ๐2 + ๐3
๐0
โ๏ธb โ [
โ2๐ง
, (22)
where
[ = ๐1 (`๐ถ1 โ `๐ผ1) + ๐2 (`๐ผ2 โ `๐ถ2) + ๐3 (`๐ผ๐ โ `๐ผ๐ ),b = ๐1 (๐2
๐ถ1 + ๐2๐ผ1) + ๐2 (๐2
๐ผ2 + ๐2๐ถ2) + ๐3 (๐2
๐ผ๐+ ๐2
๐ผ๐), and
๐ง = ๐ง1โ๐ผ/2 โ ๐ง1โ๐min .
AdKDD โ20, Aug 2020, Online Liu and McCoy
Writing [, b , and ๐ง as shown above, we substitute in the expres-sions for ฮ๐2, \โ๐2, ฮ๐3, and \โ๐3 (Equations (11), (12), (14), and (15)respectively) into the initial inequality to obtain
[
๐1 + ๐2 + ๐3โ [
๐0 + ๐1 + ๐2 + ๐3
> ๐ง
โ๏ธ2b
(๐1 + ๐2 + ๐3)2 โ ๐ง
โ๏ธ2(๐0 (2๐2
๐ถ0) + b)
(๐0 + ๐1 + ๐2 + ๐3)2 . (22a)
Pulling out the common factors on each side we have
[
(1
๐1 + ๐2 + ๐3โ 1๐0 + ๐1 + ๐2 + ๐3
)>โ
2๐ง
( โ๏ธb
๐1 + ๐2 + ๐3โ
โ๏ธ2๐0๐2
๐ถ0 + b
๐0 + ๐1 + ๐2 + ๐3
). (22b)
Writing the partial fraction on the LHS of Inequality (22b) as acomposite fraction we have
[
(๐0
(๐1 + ๐2 + ๐3) (๐0 + ๐1 + ๐2 + ๐3)
)>โ
2๐ง
( โ๏ธb
๐1 + ๐2 + ๐3โ
โ๏ธ2๐0๐2
๐ถ0 + b
๐0 + ๐1 + ๐2 + ๐3
). (22c)
We then move the composite fraction to the RHS and theโ
2๐ง termto the LHS:
[โ
2๐ง>
(๐1 + ๐2 + ๐3)ยท(๐0 + ๐1 + ๐2 + ๐3)
๐0
( โ๏ธb
๐1 + ๐2 + ๐3โ
โ๏ธ2๐0๐2
๐ถ0 + b
๐0 + ๐1 + ๐2 + ๐3
),
(22d)
and expand the brackets, canceling terms that appear on both sidesof the fractions in the RHS:
[โ
2๐ง>
๐0 + ๐1 + ๐2 + ๐3๐0
โ๏ธb โ ๐1 + ๐2 + ๐3
๐0
โ๏ธ2๐0๐2
๐ถ0 + b . (22e)
Finally, we swap the position of the leftmost term with that of therightmost term in Inequality (22e) to arrive at Inequality (22):
๐1 + ๐2 + ๐3๐0
โ๏ธ2๐0๐2
๐ถ0 + b >๐0 + ๐1 + ๐2 + ๐3
๐0
โ๏ธb โ [
โ2๐ง
.
The trivial case: RHS โค 0. We first observe that the LHS of In-equality (22) is always positive, and hence the inequality triviallyholds if the RHS is non-positive. Here we show RHS โค 0 is equiva-lent to
๐0 + ๐1 + ๐2 + ๐3๐0
\โ๐3 โค ฮ๐3 . (23)
The can be done by writing RHS โค 0 in full:๐0 + ๐1 + ๐2 + ๐3
๐0
โ๏ธb โ [
โ2๐ง
โค 0, (23a)
and moving the second term on the LHS to the RHS:๐0 + ๐1 + ๐2 + ๐3
๐0
โ๏ธb โค [
โ2๐ง
. (23b)
We then add a factor ofโ
2๐ง/(๐1 + ๐2 + ๐3) on both sides to get
๐0 + ๐1 + ๐2 + ๐3๐0
โ๏ธbโ
2๐ง๐1 + ๐2 + ๐3
โค [
๐1 + ๐2 + ๐3. (23c)
Noting from Equations (14) and (15) that
ฮ๐3 =[
๐1 + ๐2 + ๐3and \โ๐3 =
โ2 ยท ๐ง ยท
โ๏ธb
๐1 + ๐2 + ๐3,
we finally replace the terms in Inequality (23c) with ฮ๐3 and \โ๐3 to
arrive at Inequality (23):๐0 + ๐1 + ๐2 + ๐3
๐0\โ๐3 โค ฮ๐3 .
The non-trivial case: RHS > 0. We then tackle the case where theRHS of the master inequality (Inequality (22)) is greater than zero.We show in this non-trivial case, Inequality (22) is equivalent to
2๐2๐ถ0๐0
>
(\โ๐3 โ ฮ๐3 + ๐1+๐2+๐3
๐0\โ๐3
)2โ
(๐1+๐2+๐3
๐0\โ๐3
)2
2๐ง2 . (24)
We first multiply both sides of Inequality (22) with the fraction๐0โ
2๐ง/(๐1 + ๐2 + ๐3) to get
๐1 + ๐2 + ๐3๐0
โ๏ธ2๐0๐2
๐ถ0 + b๐0โ
2๐ง๐1 + ๐2 + ๐3
>
(๐0 + ๐1 + ๐2 + ๐3
๐0
โ๏ธb โ [
โ2๐ง
)๐0โ
2๐ง๐1 + ๐2 + ๐3
. (24a)
Canceling terms on both sides of the fractions we haveโ๏ธ2๐0๐2
๐ถ0 + bโ
2๐ง
> (๐0 + ๐1 + ๐2 + ๐3)โ๏ธbโ
2๐ง๐1 + ๐2 + ๐3
โ ๐0[
๐1 + ๐2 + ๐3. (24b)
Again noting the identities for ฮ๐3 and \โ๐3 stated above, we can
replace the fractions on the RHS and obtainโ๏ธ2๐0๐2
๐ถ0 + bโ
2๐ง > (๐0 + ๐1 + ๐2 + ๐3)\โ๐3 โ ๐0ฮ๐3 . (24c)
We then square both sides of Inequality (24c) and move the 2๐ง2
term to the RHS:
2๐0๐2๐ถ0 + b >
((๐0 + ๐1 + ๐2 + ๐3)\โ๐3 โ ๐0ฮ๐3
)2
2๐ง2 . (24d)
Note the squaring still allows the implication to go both ways asboth sides of Inequality (24c) are positive. Based on the identity for\โ๐3, we observe b can also be written as
b =(๐1 + ๐2 + ๐3)2 (\โ
๐3)2
2๐ง2 . (24e)
Thus, we can group all terms with a 2๐ง2 denominator by moving bin Inequality (24d) to the RHS and substituting Equation (24e) intothe resultant inequality:
2๐0๐2๐ถ0 >
((๐0 + ๐1 + ๐2 + ๐3)\โ๐3 โ ๐0ฮ๐3
)2โ((๐1 + ๐2 + ๐3)\โ๐3
)2
2๐ง2 .
(24f)
We finally normalize the inequality to one with unit ฮ๐3 and \โ๐3to enable effective comparison. We divide both sides of Inequal-ity (24f) by ๐02:
2๐2๐ถ0๐0
>
(๐0+๐1+๐2+๐3
๐0\โ๐3 โ ฮ๐3
)2โ
(๐1+๐2+๐3
๐0\โ๐3
)2
2๐ง2 , (24g)
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD โ20, Aug 2020, Online
and split the coefficient of \โ๐3 in the first squared term into an
integer (1) and a fractional ((๐1 + ๐2 + ๐3)/๐0) part to arrive atInequality (24):
2๐2๐ถ0๐0
>
(\โ๐3 โ ฮ๐3 + ๐1+๐2+๐3
๐0\โ๐3
)2โ
(๐1+๐2+๐3
๐0\โ๐3
)2
2๐ง2 .
C DUAL CONTROLWe finally clarify the calculations in Section 3.3 of the paper, wherewe determine the sample size required for Setup 4 (aka a dual controlsetup) to emerge superior to Setup 3 (a simpler A/B test setup). Inthe paper we showed that \โ
๐4 > \โ๐3 always holds, and hence
Setup 4 will never be superior to Setup 3 under the first evaluationcriterion. We thus focus on the second evaluation criterion ฮ๐4 โฮ๐3 > \โ
๐4 โ \โ๐3, and first show that the criterion is equivalent to
Inequality (27). Assuming the ratio of user group sizes ๐1 : ๐2 : ๐3remains unchanged, we then show how(i) The RHS of the Inequality (27) remains a constant and(ii) The LHS of the Inequality (27) scales along ๐ (
โ๐), where ๐ is
the number of users.The results mean Setup 4 could emerge superior to Setup 3 if wehave sufficiently large number of users. We also show from theinequality that(iii) The number of users required for a dual control setup to emerge
superior to simpler setups is accessible only to the largest orga-nizations and their top affiliates.
The master inequality. We first show the criterion ฮ๐4 โ ฮ๐3 >
\โ๐4 โ \โ
๐3 is equivalent to
๐1๐2 (`๐ผ2โ`๐ถ2)+๐3 (`๐ผ๐โ`๐ถ3)
๐2+๐3โ ๐2
๐1 (`๐ผ1โ`๐ถ1)+๐3 (`๐ผ๐โ`๐ถ3)๐1+๐3โ๏ธ
๐1 (๐2๐ถ1 + ๐2
๐ผ1) + ๐2 (๐2๐ถ2 + ๐2
๐ผ2) + ๐3 (๐2๐ผ๐
+ ๐2๐ผ๐)
> (27)
โ2๐ง
โโโโโโโโโ2 ยท
(1 + ๐2๐1+๐3
)2 [๐1 (๐2๐ถ1 + ๐2
๐ผ1) + ๐3 (๐2๐ถ3 + ๐2
๐ผ๐)]+
(1 + ๐1๐2+๐3
)2 [๐2 (๐2๐ถ2 + ๐2
๐ผ2) + ๐3 (๐2๐ถ3 + ๐2
๐ผ๐)]
๐1 (๐2๐ถ1 + ๐2
๐ผ1) + ๐2 (๐2๐ถ2 + ๐2
๐ผ2) + ๐3 (๐2๐ผ๐
+ ๐2๐ผ๐)
โ 1
,where ๐ง = ๐ง1โ๐ผ/2 โ ๐ง1โ๐min . The number of terms involved is large,and hence we first simplify the LHS and RHS independently, andcombine them in the final step.
For the LHS (i.e. ฮ๐4 โ ฮ๐3), we substitute in Equations (14)and (18) to obtain๐2 (`๐ผ2โ`๐ถ2) + ๐3 (`๐ผ๐ โ`๐ถ3)
๐2 + ๐3โ๐2 (`๐ผ1โ`๐ถ1) + ๐3 (`๐ผ๐โ`๐ถ3)
๐1 + ๐3โ
๐1 (`๐ถ1 โ `๐ผ1) + ๐2 (`๐ผ2 โ `๐ถ2) + ๐3 (`๐ผ๐ โ `๐ผ๐ )๐1 + ๐2 + ๐3
. (27a)
The expression can be rewritten in terms of multiplicative productsbetween the ๐-terms and the (difference between) `-terms:
๐1 (`๐ผ1 โ `๐ถ1)[โ 1
๐1+๐3+ 1
๐1+๐2+๐3
]+
๐2 (`๐ผ2 โ `๐ถ2)[ 1๐2+๐3
โ 1๐1+๐2+๐3
]+
๐3`๐ผ๐[ 1๐2+๐3
โ 1๐1+๐2+๐3
]+ ๐3`๐ผ๐
[โ 1
๐1+๐3+ 1
๐1+๐2+๐3
]+
๐3`๐ถ3[โ 1
๐2+๐3+ 1
๐1+๐3
]. (27b)
We then extract a 1/(๐1 + ๐2 + ๐3) term from Expression (27b):
(๐1 + ๐2 + ๐3)โ1 [๐1 (`๐ผ1 โ `๐ถ1)(โ ๐1+๐2+๐3
๐1+๐3+ 1
)+
๐2 (`๐ผ2 โ `๐ถ2)(๐1+๐2+๐3
๐2+๐3โ 1
)+
๐3`๐ผ๐(๐1+๐2+๐3
๐2+๐3โ 1
)+ ๐3`๐ผ๐
(โ ๐1+๐2+๐3
๐1+๐3+ 1
)+
๐3`๐ถ3(โ ๐1+๐2+๐3
๐2+๐3+ ๐1+๐2+๐3
๐1+๐3
) ]. (27c)
This allows us to perform some cancellation with the RHS, whichalso has a 1/(๐1 + ๐2 + ๐3) term, in the final step. Noting
๐1 + ๐2 + ๐3๐1 + ๐3
= 1 + ๐2๐1 + ๐3
and๐1 + ๐2 + ๐3๐2 + ๐3
= 1 + ๐1๐2 + ๐3
,
we can write the inner square brackets as
(๐1 + ๐2 + ๐3)โ1 [๐1 (`๐ผ1 โ `๐ถ1)(โ ๐2
๐1+๐3
)+ ๐2 (`๐ผ2 โ `๐ถ2)
( ๐1๐2+๐3
)+
๐3`๐ผ๐( ๐1๐2+๐3
)+ ๐3`๐ผ๐
(โ ๐2
๐1+๐3
)+
๐3`๐ถ3(โ 1 โ ๐1
๐2+๐3+ 1 + ๐2
๐1+๐3
) ], (27d)
and group the ๐1/(๐2 + ๐3) and ๐2/(๐1 + ๐3) terms to arrive at1
๐1 + ๐2 + ๐3
[ ๐1๐2 + ๐3
(๐2 (`๐ผ2 โ `๐ถ2) + ๐3 (`๐ผ๐ โ `๐ถ3)
)โ
๐2๐1 + ๐3
(๐1 (`๐ผ1 โ `๐ถ1) + ๐3 (`๐ผ๐ โ `๐ถ3)
) ]. (27e)
For the RHS (i.e. \โ๐4 โ \โ
๐3), we substitute in Equations (15)and (19) to obtain
2๐ง
โโ๐1 (๐2
๐ถ1+๐2๐ผ1) + ๐3 (๐2
๐ถ3+๐2๐ผ๐)
(๐1 + ๐3)2 +๐2 (๐2
๐ถ2+๐2๐ผ2) + ๐3 (๐2
๐ถ3+๐2๐ผ๐)
(๐2 + ๐3)2
โโ
2๐ง
โโ๐1 (๐2
๐ผ1 + ๐2๐ถ1) + ๐2 (๐2
๐ถ2 + ๐2๐ผ2) + ๐3 (๐2
๐ผ๐+ ๐2
๐ผ๐)
(๐1 + ๐2 + ๐3)2 , (27f)
where ๐ง = ๐ง1โ๐ผ/2 โ ๐ง1โ๐min . We then extract aโ
2๐ง/(๐1 + ๐2 + ๐3)term from Expression (27f) to arrive at
โ2๐ง
๐1 + ๐2 + ๐3
[โ
2
โโโ(๐1+๐2+๐3๐1+๐3
)2 [๐1 (๐2๐ถ1+๐
2๐ผ1) + ๐3 (๐2
๐ถ3+๐2๐ผ๐)]+(๐1+๐2+๐3
๐2+๐3
)2 [๐2 (๐2๐ถ2+๐
2๐ผ2) + ๐3 (๐2
๐ถ3+๐2๐ผ๐)]
โ
โ๏ธ๐1 (๐2
๐ถ1 + ๐2๐ผ1) + ๐2 (๐2
๐ถ2 + ๐2๐ผ2) + ๐3 (๐2
๐ผ๐+ ๐2
๐ผ๐)],
(27g)
where (๐1 +๐2 +๐3)/(๐2 +๐3) and (๐1 +๐2 +๐3)/(๐1 +๐3) can alsobe written as 1 + ๐1/(๐2 + ๐3) and 1 + ๐2/(๐1 + ๐3) respectively.
We finally combine both sides of the inequality by taking Ex-pressions (27e) and (27g):
1๐1 + ๐2 + ๐3
[ ๐1๐2 + ๐3
(๐2 (`๐ผ2 โ `๐ถ2) + ๐3 (`๐ผ๐ โ `๐ถ3)
)โ
๐2๐1 + ๐3
(๐1 (`๐ผ1 โ `๐ถ1) + ๐3 (`๐ผ๐ โ `๐ถ3)
) ]>
โ2๐ง
๐1 + ๐2 + ๐3
[โ
2
โโโ(1 + ๐2
๐1+๐3
)2 [๐1 (๐2๐ถ1+๐
2๐ผ1) + ๐3 (๐2
๐ถ3+๐2๐ผ๐)]+(
1 + ๐1๐2+๐3
)2 [๐2 (๐2๐ถ2+๐
2๐ผ2) + ๐3 (๐2
๐ถ3+๐2๐ผ๐)]โ
โ๏ธ๐1 (๐2
๐ถ1 + ๐2๐ผ1) + ๐2 (๐2
๐ถ2 + ๐2๐ผ2) + ๐3 (๐2
๐ผ๐+ ๐2
๐ผ๐)].
(27h)
AdKDD โ20, Aug 2020, Online Liu and McCoy
Canceling the common 1/(๐1 +๐2 +๐3) terms on both sides, and di-viding both sides by
โ๏ธ๐1 (๐2
๐ถ1+๐2๐ผ1) + ๐2 (๐2
๐ถ2+๐2๐ผ2) + ๐3 (๐2
๐ผ๐+๐2
๐ผ๐)
leads us to Inequality (27):
๐1๐2 (`๐ผ2โ`๐ถ2)+๐3 (`๐ผ๐โ`๐ถ3)
๐2+๐3โ ๐2
๐1 (`๐ผ1โ`๐ถ1)+๐3 (`๐ผ๐โ`๐ถ3)๐1+๐3โ๏ธ
๐1 (๐2๐ถ1 + ๐2
๐ผ1) + ๐2 (๐2๐ถ2 + ๐2
๐ผ2) + ๐3 (๐2๐ผ๐
+ ๐2๐ผ๐)
>
โ2๐ง
โโโโโโโโโ2 ยท
(1 + ๐2๐1+๐3
)2 [๐1 (๐2๐ถ1 + ๐2
๐ผ1) + ๐3 (๐2๐ถ3 + ๐2
๐ผ๐)]+
(1 + ๐1๐2+๐3
)2 [๐2 (๐2๐ถ2 + ๐2
๐ผ2) + ๐3 (๐2๐ถ3 + ๐2
๐ผ๐)]
๐1 (๐2๐ถ1 + ๐2
๐ผ1) + ๐2 (๐2๐ถ2 + ๐2
๐ผ2) + ๐3 (๐2๐ผ๐
+ ๐2๐ผ๐)
โ 1
.RHS remains a constant. We then simplify the ๐2-terms in In-
equality (27) by assuming that they are similar in magnitude, i.e.
๐2๐ถ1 โ ๐2
๐ผ1 โ ยท ยท ยท โ ๐2๐ผ๐
โ ๐2๐ ,
and show the RHS of the inequality is equal to
โ2๐ง
[โ๏ธ๐1 + ๐2 + ๐3๐1 + ๐3
+ ๐1 + ๐2 + ๐3๐2 + ๐3
โ 1]. (28)
As long as the group size ratio ๐1 : ๐2 : ๐3 remains unchanged,Expression (28) will remain a constant. It is safe to apply the sim-plifying assumption as we know from the evaluation frameworkspecification that there are three classes of parameters in the in-equality: the user group sizes (๐), the mean responses (`), and theresponse variances (๐2). Among these three classes of parameters,only the user group sizes have the potential to scale in any practicalsettings, and thus we can effectively treat the means and variancesas constants below.
We begin by substituting ๐2๐into Inequality (27):
๐1๐2 (`๐ผ2โ`๐ถ2)+๐3 (`๐ผ๐โ`๐ถ3)
๐2+๐3โ ๐2
๐1 (`๐ผ1โ`๐ถ1)+๐3 (`๐ผ๐โ`๐ถ3)๐1+๐3โ๏ธ
๐1 (2๐2๐) + ๐2 (2๐2
๐) + ๐3 (2๐2
๐)
>
โ2๐ง
โโโโโโโ
2 ยท
(1 + ๐2๐1+๐3
)2 [๐1 (2๐2๐) + ๐3 (2๐2
๐)]+
(1 + ๐1๐2+๐3
)2 [๐2 (2๐2๐) + ๐3 (2๐2
๐)]
๐1 (2๐2๐) + ๐2 (2๐2
๐) + ๐3 (2๐2
๐)
โ 1
. (28a)
Moving the common 2๐2๐terms out and canceling the common
terms in the RHS fraction we have
๐1๐2 (`๐ผ2โ`๐ถ2)+๐3 (`๐ผ๐โ`๐ถ3)
๐2+๐3โ ๐2
๐1 (`๐ผ1โ`๐ถ1)+๐3 (`๐ผ๐โ`๐ถ3)๐1+๐3โ๏ธ
2๐2๐(๐1 + ๐2 + ๐3)
>
โ2๐ง
[โ๏ธ2 ยท
(1+ ๐2๐1+๐3
)2 (๐1 + ๐3) + (1+ ๐1๐2+๐3
)2 (๐2 + ๐3)(๐1 + ๐2 + ๐3)
โ 1
].
(28b)
We can already see the LHS of Inequality (28b) scales along๐ (โ๐)โ
we will demonstrate this result in greater detail below.Focusing on the RHS of the inequality, we express the squared
terms as rational fractions and divide each term in the numerator
by the denominator to obtain
โ2๐ง
โโโ
2
[(๐1+๐2+๐3๐1 + ๐3
)2๐1 + ๐3
๐1+๐2+๐3+(๐1+๐2+๐3๐2 + ๐3
)2๐2 + ๐3
๐1+๐2+๐3
]โ1
.(28c)
Canceling the common ๐1 + ๐2 + ๐3 terms leads to that presentedin Expression (28):
โ2๐ง
[โ๏ธ๐1 + ๐2 + ๐3๐1 + ๐3
+ ๐1 + ๐2 + ๐3๐2 + ๐3
โ 1].
LHS scales along ๐ (โ๐). We demonstrate the scaling relation
between the LHS of Inequality (27) and the number of users in eachgroup by simplifying the ๐-terms (but not the ๐2-terms as above),assuming ๐1 โ ๐2 โ ๐3 โ ๐, and showing the LHS of the inequalityis equal to
โ๐((`๐ผ2 โ `๐ถ2) โ (`๐ผ1 โ `๐ถ1) + `๐ผ๐ โ `๐ผ๐
)2โ๏ธ๐2๐ถ1 + ๐2
๐ผ1 + ๐2๐ถ2 + ๐2
๐ผ2 + ๐2๐ผ๐
+ ๐2๐ผ๐
. (29)
While the relationship (that the LHS of the inequality scales along๐ (
โ๐)) is evident by inspecting the LHS of Inequality (28b) or even
Inequality (27) itself, we believe the simplification allows us to showthe relationship more clearly.
We begin by substituting ๐ into Inequality (27) to obtain
๐๐ (`๐ผ2โ`๐ถ2)+๐ (`๐ผ๐โ`๐ถ3)
๐+๐ โ ๐๐ (`๐ผ1โ`๐ถ1)+๐ (`๐ผ๐โ`๐ถ3)
๐+๐โ๏ธ๐(๐2
๐ถ1 + ๐2๐ผ1) + ๐(๐
2๐ถ2 + ๐2
๐ผ2) + ๐(๐2๐ผ๐
+ ๐2๐ผ๐)
> (29a)
โ2๐ง
โโโโโโโโโ2 ยท
(1 + ๐๐+๐ )
2 [๐(๐2๐ถ1 + ๐2
๐ผ1) + ๐(๐2๐ถ3 + ๐2
๐ผ๐)]+
(1 + ๐๐+๐ )
2 [๐(๐2๐ถ2 + ๐2
๐ผ2) + ๐(๐2๐ถ3 + ๐2
๐ผ๐)]
๐(๐2๐ถ1 + ๐2
๐ผ1) + ๐(๐2๐ถ2 + ๐2
๐ผ2) + ๐(๐2๐ผ๐
+ ๐2๐ผ๐)
โ 1
.Moving the common ๐-terms out and canceling them in the frac-tions where appropriate lead to
โ๐ 1
2[((`๐ผ2 โ `๐ถ2) + (`๐ผ๐ โ `๐ถ3)) โ ((`๐ผ1 โ `๐ถ1) + (`๐ผ๐ โ `๐ถ3))
]โ๏ธ๐2๐ถ1 + ๐2
๐ผ1 + ๐2๐ถ2 + ๐2
๐ผ2 + ๐2๐ผ๐
+ ๐2๐ผ๐
>
โ2๐ง
โโโโโโโโโ2 ยท
(1 + 12 )
2 (๐2๐ถ1 + ๐2
๐ผ1 + ๐2๐ถ3 + ๐2
๐ผ๐)+
(1 + 12 )
2 (๐2๐ถ2 + ๐2
๐ผ2 + ๐2๐ถ3 + ๐2
๐ผ๐)
๐2๐ถ1 + ๐2
๐ผ1 + ๐2๐ถ2 + ๐2
๐ผ2 + ๐2๐ผ๐
+ ๐2๐ผ๐
โ 1
, (29b)
where the LHS is equal to Expression (29) as claimed above.It is clear that there are no ๐-terms left on the RHS of Inequal-
ity (29), and hence the RHS remains a constant as shown previously.Setting up the inequality to demonstrate the third resultโthat thenumber of users required for a dual control setup to emerge su-perior is largeโwe further simplify the RHS of the inequality by
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD โ20, Aug 2020, Online
rearranging the terms in the square root:โ๐
[((`๐ผ2 โ `๐ถ2) + (`๐ผ๐ โ `๐ถ3)) โ ((`๐ผ1 โ `๐ถ1) + (`๐ผ๐ โ `๐ถ3))
]2โ๏ธ๐2๐ถ1 + ๐2
๐ผ1 + ๐2๐ถ2 + ๐2
๐ผ2 + ๐2๐ผ๐
+ ๐2๐ผ๐
>
โ2๐ง
[โโโ2(
32
)2 (1 +
2๐2๐ถ3
๐2๐ถ1 + ๐2
๐ผ1 + ๐2๐ถ2 + ๐2
๐ผ2 + ๐2๐ผ๐
+ ๐2๐ผ๐
)โ 1
].
(29c)
Required number of users is large. We finally show that whileSetup 4 could emerge superior to Setup 3 as the number of usersincrease, the number of users required is high.We do so by assumingboth the ๐2- and ๐-terms are similar in magnitude, i.e. ๐2
๐ถ1 โ ๐2๐ผ1 โ
ยท ยท ยท โ ๐2๐ผ๐
โ ๐2๐and ๐1 โ ๐2 โ ๐3 โ ๐, and show that Inequality (27)
is equivalent to
๐ >
(2โ
12(โ
6 โ 1)๐ง
)2 ๐2๐
ฮ2 , (30)
where ฮ = (`๐ผ2 โ `๐ถ2) โ (`๐ผ1 โ `๐ถ1) + `๐ผ๐ โ `๐ผ๐ is the actual effectsize difference between Setups 4 and 3. Note we are determiningwhen Setup 4 is superior to Setup 3 under the second evaluationcriterionโthat the gain in actual effect is greater than the loss insensitivityโand thus assume ฮ is positive.
The equivalence can be shown by substituting ๐2๐into Inequal-
ity (29c), which already assumes the ๐-terms are similar in magni-tude:8โ๐
[((`๐ผ2 โ `๐ถ2) + (`๐ผ๐ โ `๐ถ3)) โ ((`๐ผ1 โ `๐ถ1) + (`๐ผ๐ โ `๐ถ3))
]2โ๏ธ
6๐2๐
>
โ2๐ง
[โโ2(
32
)2 (1 +
2๐2๐
6๐2๐
)โ 1
]. (30a)
Noting the expression within the LHS square bracket is equal to ฮ,we simplify the expression within the RHS square root, and moveevery non-๐ term to the RHS of the inequality to obtain
โ๐ >
โ2๐ง
[โ6 โ 1
] 2โ๏ธ
6๐2๐
ฮ. (30b)
As all quantities in the inequality are positive, we can square bothsides and consolidate the coefficients on the RHS to arrive at In-equality (30):
๐ >
(2โ
12(โ
6 โ 1)๐ง
)2 ๐2๐
ฮ2 .
8Alternatively we can substitute ๐ into Inequality (28b), which already assumes the๐2-terms are similar in magnitude. Simplifying the resultant inequality would yieldthe same end result.