Optimal design of experiments to identify latent ... · Optimal design of experiments to identify...

45
Optimal design of experiments to identify latent behavioral types Stefano Balietti 1,2 , Brennan Klein 3 , and Christoph Riedl *3,4,5,6 1 Center for European Social Science Research Universität Mannheim, MZES, 68131 Mannheim, Germany 2 Research Center for Environmental Economics, Heidelberg University, 69117 Heidelberg, Germany 3 Network Science Institute, Northeastern University, Boston, MA, USA 4 D’Amore-McKim School of Business, Northeastern University, Boston, MA 02115, USA 5 Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA 6 Institute for Quantitative Social Science, Harvard University, Cambridge, MA 02138, USA August 10, 2020 Abstract Bayesian optimal experiments that maximize the information gained from collected data are critical to efficiently identify behavioral models. We extend a seminal method for designing Bayesian optimal experiments by introducing two computational improvements that make the procedure tractable: (1) a search algorithm from artificial intelligence that efficiently explores the space of possible design parameters, and (2) a sampling procedure which evaluates each design parameter combination more efficiently. We apply our procedure to a game of imperfect information to evaluate and quantify the computational improvements. We then collect data across five different experimental designs to compare the ability of the optimal experimental design to discriminate among competing behavioral models against the experimental designs chosen by a “wisdom of experts” prediction experiment. We find that data from the experi- ment suggested by the optimal design approach requires significantly less data to distinguish behavioral models (i.e., test hypotheses) than data from the experiment suggested by experts. Substantively, we find that reinforcement learning best explains human decision-making in the imperfect information game and that behavior is not adequately described by the Bayesian Nash equilibrium. Our procedure is general and computationally efficient and can be applied to dynamically optimize online experiments. 1 Introduction Experimentation in the social sciences is a fundamental tool for understanding the mechanisms and heuristics that underlie human behavior. At the same time, running experiments is a costly process and requires careful design in order to test hypotheses while maximizing statistical power. This experimental design process is often guided by the intuition of the scientists conducting the research. While there are many benefits in relying on the intuition of experienced researchers, there is often a lack of principled guides when choosing which experiment to run [1, 2]. As a result, experiments may have low power to distinguish between different models of behavior [3] and lead * [email protected] 1 arXiv:1807.07024v3 [stat.AP] 6 Aug 2020

Transcript of Optimal design of experiments to identify latent ... · Optimal design of experiments to identify...

  • Optimal design of experiments to identify latent behavioral types

    Stefano Balietti1,2, Brennan Klein3, and Christoph Riedl∗3,4,5,6

    1Center for European Social Science Research Universität Mannheim, MZES, 68131 Mannheim, Germany2Research Center for Environmental Economics, Heidelberg University, 69117 Heidelberg, Germany

    3Network Science Institute, Northeastern University, Boston, MA, USA4D’Amore-McKim School of Business, Northeastern University, Boston, MA 02115, USA

    5Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA6Institute for Quantitative Social Science, Harvard University, Cambridge, MA 02138, USA

    August 10, 2020

    Abstract

    Bayesian optimal experiments that maximize the information gained from collected data arecritical to efficiently identify behavioral models. We extend a seminal method for designingBayesian optimal experiments by introducing two computational improvements that make theprocedure tractable: (1) a search algorithm from artificial intelligence that efficiently exploresthe space of possible design parameters, and (2) a sampling procedure which evaluates eachdesign parameter combination more efficiently. We apply our procedure to a game of imperfectinformation to evaluate and quantify the computational improvements. We then collect dataacross five different experimental designs to compare the ability of the optimal experimentaldesign to discriminate among competing behavioral models against the experimental designschosen by a “wisdom of experts” prediction experiment. We find that data from the experi-ment suggested by the optimal design approach requires significantly less data to distinguishbehavioral models (i.e., test hypotheses) than data from the experiment suggested by experts.Substantively, we find that reinforcement learning best explains human decision-making in theimperfect information game and that behavior is not adequately described by the BayesianNash equilibrium. Our procedure is general and computationally efficient and can be appliedto dynamically optimize online experiments.

    1 Introduction

    Experimentation in the social sciences is a fundamental tool for understanding the mechanismsand heuristics that underlie human behavior. At the same time, running experiments is a costlyprocess and requires careful design in order to test hypotheses while maximizing statistical power.This experimental design process is often guided by the intuition of the scientists conducting theresearch. While there are many benefits in relying on the intuition of experienced researchers, thereis often a lack of principled guides when choosing which experiment to run [1, 2]. As a result,experiments may have low power to distinguish between different models of behavior [3] and lead

    [email protected]

    1

    arX

    iv:1

    807.

    0702

    4v3

    [st

    at.A

    P] 6

    Aug

    202

    0

  • to increased costs for data collection. At worst, they lead to reduced effect sizes, and incorrectrejection or acceptance of a null hypothesis [4].

    A growing body of researchers including social scientists, computer scientists, and industryprofessionals have explored ways in which experiment selection can be optimized and how artificialintelligence (AI) can be used to select experiments [5]. For example, researchers now optimizethe experiments they run using statistical techniques (e.g., Thompson sampling) to preferentiallyassign participants to experimental treatments in order to optimize the treatment effect. This allowsresearchers to easily decide which treatment among many possible treatment arms is most effectivein maximizing a certain outcome, such as the response rate in a marketing campaign [6, 7, 8,9]. Other techniques involve optimizing the length of the experiment, the number of participantsneeded, or the sequence of questions in behavioral batteries [10, 11, 12, 13]. In each case, optimizingexperimental design has proved fruitful.

    However, these techniques tend to focus on the optimization of individuals’ decisions (e.g., find-ing the best sequence of survey questions for estimating an individual’s risk preferences). Theseapproaches are ill-suited for those situations involving strategic interactions, information asymme-tries, or network effects [14, 15, 16]. In these contexts, agents with heterogeneous preferences derivedifferent utilities for outcomes that depend on the actions of others agents [17, 18]. For example,for goods with network effects, the utility of one consumer depends not only on that consumer’spreference for the product but also on how many others adopt the good [19, 20]. Attempting tooptimize experiments with such strategic complements thus requires a different approach.

    We focus on a different set of optimal experimental design procedures that optimize interactiveexperiments mapping human decision-making to behavioral types, i.e., mathematical models thatpredict human behavior under specific sociological, economic, and psychological hypotheses [21].By building and testing various models about how individuals of a specific behavioral type makestrategic decisions, modelers and experimenters can make fine-tuned predictions about the likelybehavior of participants in experimental settings. As a result, we are better positioned to anticipateresponses to societal, market, or technological changes [22].

    In this paper, we introduce an AI-based optimal design procedure that maximizes the informa-tion gained from running a behavioral experiment. Our procedure recommends which experimentsto run (i.e., which parameter values to use for data collection) in order to maximally diverge thepredictions of multiple competing models of human behavior. The output of this procedure takesthe form of a coordinate—a point in the space of all possible experiments—that corresponds to opti-mal experimental parameters. By way of example, consider an experiment that measures gamblingtendency and is parameterized by a maximum payout (M ) and a payout multiplier (m). The exper-imenter’s question is which values to choose for M and m for data collection. Our protocol wouldoutput a particular experimental design in the form of a coordinate, (M, m), for example, M=$4.25and m=1.2 (as opposed to M=$4.15 and m=1.3). While knowing the optimal experimental designis valuable for collecting informative data, finding the optimal design is a computationally-intensivetask. The approach generally requires evaluating complex models such as whether a set of ob-servations corresponds to a Nash equilibrium. We propose two methodological innovations thatbuild on seminal methods in Bayesian optimal design [23]. The first one addresses which parametercombinations are evaluated, the second addresses how the information content of a given parametercombination is evaluated. Together, both improvements reduce the computational costs of designingoptimal experiments by several orders of magnitude while maintaining accuracy. We give a briefintuition behind our two improvements.

    2

  • Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

    Mod

    el L

    ikel

    ihoo

    d1.00 —

    0.00 —

    0.50 —

    1.00 —

    0.00 —

    0.50 —

    Model Fitting: Data from Experiment A

    Model Fitting: Data from Experiment B

    Low information gain High information gain

    Figure 1: Illustration of lowly- vs. highly-informative experiments. Models 1, 2, and 3assign likelihoods to the data generated in two hypothetical experiments: A and B. ExperimentA does not allow researchers to say with certainty which model fits the data best; in other words,the design of Experiment A generates data that contains “Low Information.” On the other hand,Experiment B has a clear winner: Model 1 predicted the observed data as highly likely, while Models2 and 3 did not. Therefore, the data generated by Experiment B contains “High Information.”The optimal experimental design procedure outlined in this paper returns values for experimentalparameters that are expected to maximally distinguish the likelihoods of the competing models, asin Experiment B.

    First, we use an adaptive search algorithm based on a Gaussian Process to efficiently searchthe space of all possible experimental parameter combinations [24]. Absent such an improvement,the optimal design procedure has to follow a brute-force approach that evaluates all possible ex-perimental designs. The intuition behind this improvement is that the search algorithm adaptivelyexplores those experimental designs that are likely very good (designs with high information con-tent) more thoroughly, while reducing the exploration of parameter combinations that are similarto those already known to be bad (low information content).

    Second, we develop a sampling technique that evaluates the informativeness of a given experi-mental design on the basis of simulated likely datasets instead of all observable datasets. This allowsus to reduce the size of datasets while simultaneously increasing the relevance of data points in thedataset and thus retaining accuracy despite the smaller size.

    There are six main parts in this article. First, we describe how to optimize experimental designfor information gain, and we lay out our methodological improvements to this protocol. Second, weevaluate and quantify the extent of these computational improvements. Third, we use our improvedprocedure to find the optimal experimental design of a classic two-person imperfect informationgame [23]. Fourth, we conduct an expert prediction experiment to see which experimental designsdomain experts would recommend, and we compare these predictions to the optimal experiment

    3

  • suggested by our algorithmic approach. Fifth, we implement and run five experiments—each withdifferent parameterizations—to illustrate the crucial role that information gain plays in distinguish-ing competing models of behavior and to showcase the adaptive nature of our approach. Finally,we expand the set of behavioral models (with Roth-Erev reinforcement learning) to illustrate howthe optimal design procedure can be used iteratively to test behavioral hypotheses as they arise.

    2 Optimal Experimental Design: Theory

    The notion of information gain is intuitive, but it is seldom defined precisely. Here, informationrefers to bits—a mathematical quantity that encodes counterfactual knowledge and describes theamount of uncertainty or noise in a system. In the context of experimentation, we gain informationby becoming more certain about the relationships between what we predict and what we observe inthe data that we collect. This is equivalent to saying that we are becoming more certain about themodels and theories we use to describe the world. Different theories generate different models thatpredict how humans will behave, and an informative experiment is one that is able to rule out modelsthat cannot adequately explain the data that they are confronted with. With an experimental designthat maximizes information gain, a small batch of data may be sufficient to determine which ofthese competing models is best suited to describe the data observed, and—importantly—this is notnecessarily true for experimental designs that do not (see Fig. 1 for a visual example of experimentswith low and high information gain).

    Using this formalism, we can compare any different models’ likelihoods of experimental out-comes based on how much relative information each distribution will provide to us. This concepthas clear applications to the optimal design of experiments for testing competing hypotheses abouthow humans behave. All that is required is to specify competing hypotheses in the form of gener-ative models of human behavior, assigning a certain likelihood to any given outcome of the game(the likelihoods must sum to 1). The best experiment we can run is then the experiment that willmaximally distinguish the likelihood distributions of the different competing models. This is for-mally achieved by maximizing the Kullback-Leibler (KL) divergence [25] between two distributions,P and Q:

    DKL[P ||Q] =∑x∈X

    p(x) logp(x)

    q(x)where DKL[P ||Q] 6= DKL[Q||P ] (1)

    Here, we refer to the optimal experiment as the particular set of design parameters, θ ∈ Θ, thatare predicted to maximally distinguish between n competing generative models of behavior in agiven task. As in El Gamal & Palfrey (1996), we use a one-sided form of the KL divergence, whichcompares n− 1 competing models to a single model, expressed as I(1; θ), as expressed below:

    I(1; θ) =∑x∈X

    l1(x; θ) log

    ((1− p1)l1(x; θ)

    n∑i=2

    pili(x; θ)

    )(2)

    where li(x; θ) is the likelihood that model i assigns to observing a specific dataset x in the space ofall possible datasets X under the experimental design θ, and pi is the experimenter prior over modeli (both the sum of priors over models and the sum of model likelihoods over datasets are equal to1). In other words, Equation 2 assigns a value for each combination of experimental parameters

    4

  • in θ. This value corresponds to the amount of information we would expect to gain as a result ofrunning an experiment with that particular combination of parameters1. By computing this valuefor every coordinate in a grid comprised of each parameter combination, we are able to create an“information surface” where the value of each point represents the information theoretic differencebetween one model and n− 1 competing models (for a visual example of this, see Fig. 3).

    As an illustration, consider the following example. The three authors of this paper each beton a different model that they believe best describes human behavior in a given game. In order todetermine who bet on the best model, we run an experiment. Whichever model has the highestrelative likelihood for fitting the experimental data is considered the winning model. Obviously, tofind a winner, it is crucial to avoid ties in model fitting. To do so—before running the experiment—we look at the distributions of likelihoods that each models assign to every possible game dataset.Intuitively, we can expect much of the datasets to be associated with similar likelihoods from thethree models. However, there is important information encoded in where the models’ predictionsdiverge. Our procedure locates the experimental design(s) that accentuate the different modelpredictions, and this is represented by the experimental parameterization that maximizes the KLdivergence between each model’s likelihoods of observing particular datasets. We would then runthat experiment and use the data collected to determine which model most likely describes thebehavior we observe. Crucially, had we not run the optimal design, our ability to distinguishbetween—and eventually rank—the competing models might have been obscured. On the otherhand, an experiment that was optimized for information gain may allow us to rule out suboptimalmodels of behavior after collecting only a small batch of data. This ability (in addition to the factthat the experiment will likely be cheaper and require fewer participants) gives “information gain”an important role in understanding and optimizing experimental design.

    In the following section, we will detail the specifics of the experiment we are attempting tooptimize. After this introduction to the experiment, the following two sections lay out the maincontributions of the current paper. The first contribution is methodological; we describe two im-provements to the optimal design procedure above. Then, we describe our method for comparingthe outputs from our algorithm to experimental designs recommended to us from experts in behav-ioral economics and digital experimentation. The final section of this paper reports results fromactually running several versions of this experiment in order to compare the predictions of expertsto the output from our optimal design procedure.

    3 Sample Application: Imperfect Information Game and Behav-ioral Models (Types)

    Every study can benefit from an optimal experimental design. In the current work, we have chosento optimize a two-player incomplete information game played over the course of three rounds, calledthe Stop-Go game [23]. This experiment is well-suited for an optimal design procedure becausethere are just two experimental parameters that define the gameplay, in addition to several modelparameters. This is enough to create a vast number of possible outcomes, which makes an optimalexperiment all the more necessary to run in order to learn from the data collected. Furthermore, we

    1 It is important to note that this difference metric is a directed, asymmetric measure. Wang et al. use a similar

    metric but propose using the average KL divergence [10], which can be expressed as I(θ) =n∑i

    piI(i; θ).

    5

  • can directly compare our results against El-Gamal & Palfrey (1996), taking into account potentialdifferences in lab vs. online behavior in the game.

    The economic intuition for the Stop-Go game played over rounds is about learning and signalingof a given behavioral type within the population. In general terms, the assumption of homogeneitymakes the game resemble the process of stereotype creation or, if played within a larger populationover many rounds, norm formation. A more concrete economic application is the following: Thegame represents a number of situations in which the sender tries to manipulate the receiver with the“Go” signal but the receiver tries to out-think the sender’s manipulation. For example, a business(the sender) is choosing whether or not to advertise (go or stop) and a customer (the receiver) istrying to decide which product to buy. The business wants the customer to get the worse deal whilethe customer wants the better deal. The business knows whether or not the product is a good dealand can choose whether or not to advertise. The customer sees only an advertisement but doesn’tknow whether the deal is actually good.

    In the next subsections, we provide a brief explanation of the Stop-Go game and the behavioralmodels we would like to test on it. While an understanding of these details is important for thisparticular study, our optimal experimental design approach is fundamentally agnostic to both thespecific experiment and the behavioral models it is applied to.

    3.1 Game Rules and Parameters to Optimize

    The Stop-Go game starts with nature randomly selecting the state of the world: State a withprobability π and state b with probability 1 − π (these two probabilities are common knowledge).Player 1 (color “Red”) is informed about the state of the world, while Player 2 (color “Blue”) is not.Player 1 then chooses Stop or Go. Stop ensures that both players receive a payout of 1, regardlessof the state of the world.2 If Go is selected, it is Player 2’s turn to select Left or Right. Finally,Player 1 and Player 2 are given payouts that depend on the state of the world and on one of thedesign parameters to optimize (A). Fig. 2 shows the decisions tree for one round of a two-playerStop-Go game containing the two design parameters (A and π). Namely,

    • A is the maximum payout that either player can receive,

    • π is the probability that the state of the world is a.

    In El Gamal & Palfrey (1996)’s experiment, A was set to 3.33 and π to 0.5, generating payoffsas shown in Table 1.

    Payoff Tables when A = 3.33 ($10.00)Payoff Table World a (π) Payoff Table World b (1− π)

    Left Right Left RightPlayer 1: $0.00 Player 1: $10.00 Player 1: $6.00 Player 1: $0.00Player 2: $6.00 Player 2: $0.00 Player 2: $0.00 Player 2: $10.00

    Table 1: Payoff table of the game. Assuming Player 1 chose Go, the payoffs for both playersdepend on the state of the world (a or b), and on the choice of Player 2 (Left or Right).

    2 Following the design by El Gamal & Palfrey (1996), Player 2 makes a decision even if Player 1 chooses Stop.

    6

  • Nature

    1

    (1, 1)

    Stop

    (0, 2)

    Left

    (A, 0)

    Right

    Go

    π

    1

    (2, 0)

    Left

    (0, A)

    Right

    Go

    (1, 1)

    Stop

    1− π

    2

    Figure 2: Extensive-form representation of the game. If Player 1 chooses Go, Player 2 mustmake a choice between Left and Right. Possible payoffs of the game are (0, 1, 2, A) with A > 2. Theparameters to optimize are A and π.

    The game is played in a group of 10 players for three rounds. Each player is assigned a role (1or 2) which is kept fixed throughout the whole game. Each round, players are rematched so thatthey meet a new partner playing the other role (“perfect stranger”).

    3.2 Models of Behavior and Parameters to Optimize

    The original paper by El-Gamal & Palfrey (1996) introduced the parameter πper for the participants’perception of the actual probability π and tested the following three models of behavior:3

    1. Model 1 : Each individual i plays the Bayes-Nash Equilibrium of the game, defined by(πper, A, �).

    2. Model 2 : Player 2 does not update πper following Go, and this is common knowledge.

    3. Model 3 : Individuals use fictitious play to construct beliefs about opponents’ play.

    Each of the three models takes as inputs the two experimental parameters—A and π—describedin the previous section and three model parameters: α (learning speed), � (tremble rate), and δ(accuracy in perception of probability π—it generates πper = π ± δ). Table 2 provides a summaryof both experimental and model parameters, their meanings, values, and ranges. The goal of ourBayesian optimal experimental design is now to find the point in the experimental parameter space(A times π) where the likelihoods of three models differ the most.

    3 Wording as used in El-Gamal & Palfrey (1996). See Appendix for a more detailed description.

    7

  • Definition Parameter Type Range Discrete ValuesA maximum payout to participants experimental design 2.00–6.00 20π probability of state of the world, a experimental design 0.1–0.9 20� tremble rate model 0.00–1.00 34α learning rate model 0.00–1.00 34δ misperception of π model 0.00–0.20 7

    Table 2: Parameters in the game. In order to define the optimal experiment, we need toassign likelihoods to the observations that are generated under each possible combination of theseparameters. In our computations, continuous parameters are approximated using the number ofdiscrete values shown in the last column.

    4 Optimal Experimental Design: Implementation

    In order to determine the optimal experimental design that maximizes information gain, we needto compute how likely it is to observe different outcomes. The first step in this calculation is toenumerate every possible dataset we might observe when we run our experiment. We define thebehavioral responses observed at the smallest unit of analysis in the game as a single outcome. Inour case, this is a pair of players playing one round of the game. Following Fig. 2, this gives us eightpossible outcomes: a-go-left, a-go-right, a-stop-left, a-stop-right, b-go-left, b-go-right, b-stop-left,and b-stop-right.4 The ensemble of all outcomes for all players for all rounds is a dataset (i.e., thedata obtained after running the experiment once). The total number of possible datasets is equal tothe number of permutations of outcomes with replacement. Even in a basic imperfect informationgame like ours, the number of possible datasets quickly becomes very large. For example, with onlyone round and five pairs of players, that number is 32,768. For three rounds, the total number ofunique datasets is equal to 32, 7683 ≈ 3.513. This is a great deal of likelihoods to compute! However,in this count, some datasets are “duplicates” in the sense that they contain the same distributionof outcomes but the players generating those outcomes switched positions. Depending on how themodels actually assign likelihoods to outcomes (e.g., if and how they take into account the historyof past outcomes, as would learning models) and on the rules for matching players in the game, thetotal number of likelihoods to compute can be greatly reduced. However, in our specific case, thisis not possible, as games with strategic interactions, information asymmetries, or network effectsrequire us to keep track of the sequence of outcomes experienced by each participant.

    Recall that this optimal design procedure is meant to distinguish competing models of behavior,and in order to determine the optimal experiment, we must first use each competing model to assign alikelihood to each unique dataset, for each combination of experimental design and model parameters.In the current study, we use 3 model parameters (�, α, δ). Each of these parameters needs to beintegrated out when calculating the likelihood of an individual dataset. We discretized them to test34 different values of �, 34 different values of α, and 7 different values of δ (see Table 2). Each valueof δ leads on average to 7 values of the model parameter, πper, hence 34×34×7×7 = 56, 644. Thisgenerates 56,644 unique combinations of model parameters for each experimental parameter. Thereare 2 experimental design parameters (A, π). In the default, brute-force approach, we evaluatedthem in a grid of 20 × 20 = 400 experimental parameter combinations. The output of an optimal

    4 We closely follow the implementation of the Stop-Go game by El Gamal & Palfrey (1996), where Player 2 makesa decision even if Player 1 chooses Stop.

    8

  • (A) Original method, El-Gamal &Palfrey (1996)

    (B) Replication of method from El-Gamal & Palfrey (1996)

    (C) Improved search method usingParameter-Sampled GPUCB-PE

    Figure 3: Information surfaces for experimental design. (a) Information surface originallypresented in El-Gamal & Palfrey (1996) (b) Information surface generated by replicating the optimaldesign procedure in El-Gamal & Palfrey (1996); (c) Information surface generated using Parameter-Sampled GPUCB-PE, where the points shown represent coordinates searched by the algorithm, andthe green star representing the point corresponding to the experiment that is predicted to producethe maximum information gain. Note: as described in El-Gamal & Palfrey (1996), the experimentwith the maximum information is expected to occur when A reaches its minimum value, whichapproaches (but is not equivalent to) A = 2.0.

    design procedure such as this one is an information surface that represents the multi-dimensional(in our case, two-dimensional) grid of every possible experiment and its associated information gainfrom taking the KL divergence between the dataset likelihoods of the competing models (Fig. 3).In our case, the total number of computations necessary to determine the optimal design of ourexperiment is equal to 400× 56, 644× 3.513 ≈ 820.

    4.1 Algorithmic Improvements for Optimal Design

    Running the total number of likelihood calculations proved computationally infeasible on a super-computing cluster, even for the coarse grid of experimental design parameters we chose.5 To addressthis, we introduce two improvements to the Bayesian optimal design procedure.6

    1. Which experimental designs are evaluated? We use Gaussian Process search to adaptivelyselect the next coordinate in the information surface to evaluate (Section 4.1.1).

    2. How are points evaluated? We sample datasets (by uniformly sampling model parameters andsimulating datasets) instead of spanning through all possible datasets (Section 4.1.2).

    The first improvement gets around the requirement to grid-search every combination of experi-mental design parameters in order to construct the information surface. The second improvementgets around the costly step of calculating the likelihoods of all possible datasets under a given model,

    5 In fact, we could construct the information surface for choosing the optimal design parameters only for a simplertwo-player version of our game, and it took approximately 72 hours on the following super-computing cluster: Fourhundreds parallel R v.3.x jobs, distributed across 56-core x86 64 Little Endian Intel(R) Xeon(R) cpus (E5-2680 v4@ 2.40GHz; L1d cache: 32K, L1i cache: 32K, L2 cache: 256K, L3 cache: 35840K).

    6 All code is available at http://github.com/shakty/optimal-design.

    9

    http://github.com/shakty/optimal-design

  • which was previously needed for computing the information associated with each combination ofexperimental design parameters.

    4.1.1 Gaussian Process Upper Confidence Bound-Pure Exploration

    Given the computationally taxing nature of this systematic grid-search technique for finding theoptimal experimental design, we implemented an adaptive search technique used in artificial intel-ligence research to speed up the construction of the information surface. This approach leveragesrecent algorithmic developments for adaptive search from machine learning, known as Gaussian Pro-cess (GP) regression [26, 27]. The basic premise behind GP is that there are often local correlationsamong data points, and having prior knowledge about a given data point can inform your poste-rior belief about “nearby” data. When dealing with multidimensional data, this simple assumptionproves to be quite useful for generating insights about unobserved data, allowing for the automatedgeneration of predictions about the entire search space without needing a systematic grid search.

    The search algorithm we use is known as Gaussian Process Upper Confidence Bound PureExploration (GPUCB-PE), and it is a way of reconstructing the contours of a landscape in anadaptive manner, while also minimizing regret. Here, regret is defined as the difference betweenthe maximum observed value until a given timestep and the final maximum observed value [28,24]. GPUCB-PE was introduced as a variant of GPUCB [24] and was designed to encourage moreexploratory searches (hence the inclusion of “Pure Exploration”)—behavior that is especially effectivewhen the landscape being searched is especially large or otherwise complex.

    Recall that our search task is on an information landscape where each point is a possible com-bination of experimental design parameters, each of which are associated with an information value(how much information we would gain from running an experiment with those design parameters).Intuitively, one can imagine a subset of experimental designs that are simply uninformative (i.e., theyare likely to generate datasets that will be unable to distinguish between our competing models),and it would be computationally wasteful to exhaustively search the experimental design landscapein this region. Leveraging this allows the algorithm to focus its search on parameter combinationsthat are likely to be good, while reducing the evaluation of parameter combinations that are likelyto be bad. However, this poses an explore-exploit dilemma: how can the algorithm ensure thatthere are not hidden pockets of high-value parameter combinations in otherwise low-value regions?Herein lies the power of the GPUCB-PE algorithm in particular: the pure exploration (PE). Thisvariant of Gaussian process search explicitly balances the exploitation of high-value regions of thelandscape while periodically searching in regions with high uncertainty (see SI A.1 for more detail),thereby minimizing the chances of overlooking a high-value point hidden in a low-value region ofthe landscape.

    In summary, this first algorithmic improvement allows us to quickly uncover the experimentalparameter combination with the highest information value while generating inferences about theinformation value of the points it has not yet searched and avoiding unnecessary searches in lowinformation value regions. It does this by:

    1. Initializing a multi-dimensional surface of coordinates that represent possible combinations ofexperimental parameters (in this case, combinations of A and π values).

    2. Using Equation 2, calculate the KL divergence between the model likelihoods of the n com-peting models for a sample of initial points. These initial points can be IID random points

    10

  • in the landscape, but we chose to distribute the initial points using a quasi-random spacingtechnique known as Sobol sequencing [29] which achieves better initial coverage.

    3. Proceeding with the GPUCB-PE algorithm, iteratively selecting points in the landscape andcalculating the information gain at each of those points, while also using Gaussian Processregression to infer the expected information gain and confidence bounds of every point thathas not yet been searched. This algorithm alternates between selecting (i) the point with thehighest expected information gain plus its upper confidence bound (the “UCB”) and ii) thepoint—within a region of eligible points—with the largest confidence bounds (i.e., the largestuncertainty, detailed in Appendix A.1 and [28]). The two types of points that this algorithmselects address the exploitation and exploration of the landscape, respectively.

    4. After each new point has been searched, determining if the landscape satisfies the stoppingrule, which is based on the similarity between the current reconstruction of the landscape andthe previous reconstruction. If the two landscapes are repeatedly more than 99.9% similar toeach other according to a modified Spearman rank correlation, the algorithm stops and thefinal information surface is output.

    This method enables us to reconstruct an accurate information surface (1) without needing tospan the grid of all possible experiments, (2) with more precision than a grid-searched landscape, asmaxima may be found inside the tiles described by the grid (see Section A.1 in the Appendix), and(3) all while minimizing the regret of the search process. For the interested reader, we recommendfoundational work on the regret of GPs, showing that GPs will converge to the optimal solution at anapproximate rate of O

    (1√t

    )[30], as well as more recent work that accommodates the dimensionality,

    d, of the search space, showing that GPs converge to the optimal solution at a rate of O(e

    −τt(ln t)d/4

    )[31]. While simply being regret-minimizing does not negate the challenge that a complex searchspace might play a role in ultimately converging on the optimal experimental design, it illustratesthe flexibility and power that GPs have in addressing this problem.

    In the next section, we introduce the second algorithmic improvement used in this work anddefine our novel search algorithm: the Parameter-Sampled GPUCB-PE.

    4.1.2 Sampling Parameters to Sample Datasets

    While GPUCB-PE addressed which points in the information landscape need to be evaluated, oursecond improvement addresses how the information value of each point is evaluated; namely, weintroduce a model parameter sampling procedure that allows us to avoid the costly calculation ofmodel likelihoods for every possible dataset. Recall that there are two main classes of parametersin this work: experimental design parameters and model parameters (Table 2). We use GPUCB-PEto find the most informative combination of experimental design parameters. In order to arriveat an information value for each point in the experimental design landscape, we compute a KLdivergence between the likelihoods that each model assigns to every possible dataset, for a specificexperimental design. Assigning a likelihood to every possible dataset is immensely costly, and as thenumber of participants and/or the number of rounds in an experiment increases, this computationalcomplexity becomes prohibitively large.

    We introduce a simple and effective solution for this problem, which we will refer to as theParameter-Sampled GPUCB-PE. The basic algorithmic procedure is identical to the GPUCB-PE

    11

  • approach described in Section 4.1.1 (i.e., iteratively searching combinations of experimental param-eters, honing in on those with high information values until convergence). However, when it comesto the calculation of the information value—taking the KL divergence between each model’s likeli-hoods of the approximately 3.513 datasets—we show that we can achieve the same performance bycomparing the likelihoods of only a fraction of all possible datasets in the KL divergence calculation.

    Parameter-Sampled GPUCB-PE lets us avoid spanning all N possibly-observed datasets byuniformly sampling ns � N combinations of model parameters and using them to generate nssynthetic datasets. In essence, each of the ns samples of model parameter combinations simulatesthe datasets that would be generated if participants were making decisions under a given modelwith the selected model parameters. In full, this model parameter sampling procedure entails thefollowing steps:

    1. For a given model, draw a parameter value uniformly at random for each model parameter.7

    As an example, consider Model 1, which includes three parameters: �, α, δ. This step simplydraws a random value for each parameter (�i, αi, δi), from each of their respective ranges (seeTable 2 for the ranges of values that these parameters can take).

    2. For each model being compared, sample a dataset that might be observed if the simulatedparticipants behaved according to the model parameters �i, αi, δi, for all rounds. This is doneby inserting the sampled parameters inside each of the model likelihood functions (SectionA.4 in the Appendix) and sampling a dataset from the resulting distribution of datasets.

    3. Repeat Steps 1 and 2 for a sufficiently high number of times, ns. Each model being comparedwill ultimately have its own sample of ns datasets.

    4. For each unique dataset, compute the fraction of times it shows up in each of the models’samples, thereby creating a histogram of dataset frequencies. These histograms are meantto approximate the true distribution of likelihoods of datasets and, as such, we use them tocalculate the KL divergence.

    We sample ns = 10, 000 datasets per search—several orders of magnitude less than what wasrequired before. Not only does this approach reduce the number of computations by several ordersof magnitude, we also found that sampling around 1, 000 datasets was sufficient to (1) qualitativelyreconstruct the information landscape, (2) make the algorithm satisfy the stopping rule, and (3)most importantly, find the coordinate of the optimal design on the information surface (AppendixA.1.1 for further analysis).

    It should be noted that the Parameter-Sampled GPUCB-PE has no prior knowledge of neitherthe underlying models nor their behavioral assumptions. Even so, it is able to reliably reconstructinformation surfaces, proving itself to be a very versatile approach to create informative experimentsin any knowledge domain. In addition, while we use this Parameter-Sampled GPUCB-PE approachin order to find the maximally informative experimental design, this algorithm can be used to createan analogous “information surface” corresponding to any measure that we may want to optimize(e.g., experiment cost, treatment effect).

    Together, the two improvements address and improve two key bottlenecks in the Bayesian op-timal experimental design procedure: which points get evaluated when searching for the mostinformative experiment and how the information value of each point gets assigned.

    7 Non-uniform sampling can be used when the experimenter has prior over the distribution of model parameters.

    12

  • 0 50 100 150 200 250 300Iteration

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    1.4Re

    gret

    Method for Grid Search

    GPUCB-PERandomGrid Search20 initialseeds

    4 searches 9 searches 16 searches 400 searches

    ...

    Figure 4: Comparison of algorithm performance measured by regret. Here we compare theperformance of three different search methods for finding the optimal experiment: Random search,Grid search, and GPUCB-PE. We plot the mean regret over time for each method (averaged over200 runs), where regret is defined as the difference between the maximum value uncovered by thealgorithm and the average global maximum. We show the 95% confidence interval with translucentbands around the curves (error bars for Grid Search). For each search method, we mimic the steps ofour Parameter-Sampled GPUCB-PE algorithm. That is, we sample datasets instead of exhaustivelysearching the parameter space: for every experimental parameter that the algorithm selects (i.e., forevery coordinate searched on the information surface), 10,000 model parameters are sampled, and asynthetic dataset is generated using these sampled experimental and model parameters, for each ofthe three models being compared. These sampled datasets are then used to assign an informationvalue to each coordinate in the information surface. The Parameter-Sampled GPUCB-PE algorithmfinds a solution close to the global maximum almost immediately, whereas Random and Grid Searchrequire more searches in order to find a value close to the global maximum.

    4.2 Performance Evaluation

    We compared the effectiveness of our Parameter-Sampled GPUCB-PE algorithm to a grid search ofthe landscape and a random search (Fig. 4). The algorithmic improvements we have made to theoptimal design procedure laid out in El-Gamal & Palfrey (1996) constructs an information surfacethat maintains the same general contours and precisely the same global maximum value as theoriginal algorithm (compare Fig. 3C and 3B). This allows us to rapidly understand the role thatparticular combinations of experimental design parameters play in determining how informativean experiment is. Furthermore, it is able to do it considerably faster than the traditional grid-search approach, evaluated here by iteratively constructing larger size grids (see inset in Fig. 4).A key limitation of the grid-search approach, in addition to being slow, is that the coordinate

    13

  • of the global maximum value could–and indeed does–fall between the cells of the grid. Thus,even with a very fine grid, it never finds a solution as good as the GPUCB-PE. The GPUCB-PEalgorithm performs significantly better than Random and Grid search, achieving the lowest regretand achieving it fast. Statistical analysis confirms that the regret of GPUCB-PE is significantlylower than Random starting iteration 30 until Random search incidentally discovers the true globalmaximum at approximately iteration 275 (t-test, comparison of means p < 10−10 at iteration 30).

    5 Wisdom of Experts

    There are a number of ways to assess the effectiveness of the optimal experimental design protocol wehave laid out. As we have previously described, our algorithm finds the Bayesian optimal experimentthat maximally distinguishes competing models of behavior. However, a persistent question thatarises in this approach is whether or not the output from this optimal design procedure would evenbe useful to experimentalists in academia or industry. Is the algorithm simply generating outputsthat would be obvious to a trained experimenter? We sought out experts in behavioral economicsand digital experimentation to answer this question, and we asked them to complete a survey wherethey provided us an estimate of what experiment they would run if they were in control of thedesign.

    We first described the task: these experts were to suggest parameters for an experimental designthat would maximally distinguish the likelihoods of four different models of behavior. We thencollected demographic information about the respondents (their gender, field of expertise, academicposition, and country of employment). We then described the imperfect information game, includingthe same language and figures from Section 3 and Fig. 2 (for precise wording, see Appendix). Afterdescribing the four models the experiment was meant to distinguish, we asked the participants tosuggest to us the precise values for A and π that would isolate and distinguish the different modelsfrom one another. These values were presented as text boxes that were constrained by the possibleparameters for A and π. The participants then reported their confidence in each of those predictions(from 1 - not confident to 5 - very confident), as well as indicating which model they believed wouldeventually perform best after the experiment was ultimately run.

    5.1 Recruitment Strategy

    In order to assemble a group of experts who were reasonably-suited to perform this task, we soughtout researchers who were familiar with experimental design as well as common behavioral modelssuch as Nash Equilibrium. We preferred to look for researchers who were currently faculty in adepartment related to behavioral economics. These criteria led us to choosing to email researchersfrom the following areas:

    • Authors of work presented at the CODE@MIT, a conference on digital experimentation (2014,2015, 2016, and 2017).

    • Attendees of the “Behavioral/Micro” course at the National Bureau of Economic Research(NBER) Summer Institute (2012, 2013, 2014, 2015, 2016, and 2017).

    • Authors of work presented at the Economic Science Association’s North American annualconference (2015, 2016, and 2017).

    14

  • • Boston-area faculty in economics (including Harvard University, Massachusetts Institute ofTechnology, Northeastern University, Boston University, Boston College, University of Mas-sachusetts Amherst, and University of Massachusetts Boston).

    This process generated 811 names, to whom we sent an initial email on 12/01/2017 (23 emailsbounced back, resulting in a total of 788 successfully delivered emails). This initial email served tointroduce the project, provide information about Northeastern University’s IRB protocol, providea link to the survey itself, and allow recipients to opt-out of further communication. After oneweek, a reminder email was sent to those who had not completed the survey. After another week,a thank-you email was sent to those participants who had completed the survey. This protocolwas adapted from methods used in [32]. In total, 147 participants began the survey (i.e., openedthe browser window and agreed to the informed consent), and 55 participants actually finished thesurvey. There is no difference across respondents who started the survey and those who dropped outafter beginning it. We find that more senior invitees are less likely to accept the survey invitation.However, the final sample is mainly composed of experienced researchers—those who are probablythe most interested in the subject—and contains only 11 PhD students. Full details in Sec. A.2.5of the Appendix.

    5.2 Expert Prediction Results

    We report the major descriptive results here. For full survey results, see Table 3 in the Appendix.The median response time was approximately 11 minutes. The mean response time was 105 minutes,although after removing participants who took over 6 hours to complete the survey (we assume thatthey were not actively working on the survey for that amount of time), this value became closer to16 minutes. Of the 18% of participants who began the survey, 39% ended up finishing, making fora 7% response rate overall.

    This survey generated three main takeaways. First, and most importantly, there appeared to belittle agreement among the various respondents regarding what was the optimal experiment to run,shown in Fig. 5A. This is validated by the low median self-reported confidence in their predictionsabout both A and π (a median of “2 - somewhat not confident” for both A and π).

    Second, while the estimates spanned much of the experimental design space, the modal valuewas π = 0.5 and A = 6.0. This corresponds to a point on our information surface that wouldbe expected to generate less information gain than the optimal design. The effect of running thissuboptimal experiment might be that the competing models would appear to be less distinct, atwhich point we would need more data in order to distinguish the most likely model.

    Third, this modal value suggests an interesting mechanism for how experts choose experimentaldesigns when they are uncertain. While the most commonly chosen value for π (0.5) was the sameas the optimal value, the experts surveyed were inclined to recommend a experiment that wouldpay higher incentives as the optimal one (A = 6.0) making the compensation of research subjectsmore expensive (if at least one game results in Go-Right). That is, experts appeared to believe thatgiving research participants a higher reward in an experiment would help to clarify greater differencesamong the four competing models.8 Indeed, it is generally accepted in experimental economics thatgreater rewards can increase participants’ effort and attention during an experiment [33, 34], even if

    8 Note: the optimal experiment we report in Figure 3—A ≈ 2.0 and π ≈ 0.5—is generated from an informationsurface comparing three models. The same coordinate is the optimal experiment when we include all four models.See Fig. A.3 in the Appendix for the full information surface.

    15

  • (A) Experts’ optimal experiment predictions (n=55). (B) Final description of experiments run.

    Figure 5: Expert predictions and experiments chosen. (a) Raw data from the experts surveyshowing both the modal prediction of A = 6.0 and π = 0.5, but more importantly, the inherent noiseand uncertainty that the respondents showed (data shown with small jitter to avoid over plotting);(b) These five points in the experimental design space were ultimately the ones that were tested onAmazon Mechanical Turk. Note: the experiment with highest information gain is when π = 0.5 butapproaches A = 2.0. We report results from the design (π = 0.5, A = 2.0) as human participantswould be unable to distinguish A = 2.000 from A = 2.001, for example.

    some contrary evidence is also found [35]. Additionally, experts may have made further assumptionsabout the role of different parameters of the behavioral models (e.g., participants’ actual tremblerate). Nevertheless, according to the landscape generated by our theory of optimal design, lowervalues of the payoff parameter, A, tend to increase the informativeness of our experiment, not highervalues.

    We chose to include the modal value of π = 0.5 and A = 6.0 from these results as an experimentto test when we eventually tested this optimal design procedure.

    6 Behavioral Experiments

    We collected five datasets for five different experimental designs: (1) The design suggested bythe modal expert response, (2) the optimal design determined by our algorithm, (3) a mediuminformation gain design, (4) the lowest information design, and (5) the original experiment run byEl-Gamal & Palfrey (1996) (see Fig. 5B).

    We recruited participants from the online labor market Amazon Mechanical Turk (AMT) [36,37, 38] and ran our experiments on the nodeGame platform for group experimentation online [39],according to the experimental procedure outlined below.

    16

  • Figure 6: Likelihood odd ratios of models according to real participants behavior. Theunit of observation is a complete player’s 3-round game history H. For instance, if H = 4, wesampled the game histories of 4 random Blue players and computed the likelihood of this reduceddataset according to the models. We used a bootstrapping procedure to simulate the likelihoodgenerated by datasets with fewer observations than the total actually collected in order to see thestability of these likelihoods. The final likelihood odds for each of the models is the value of therightmost data point in each panel.

    6.1 Experimental Protocol

    The experimental protocol resembled closely the description found in El-Gamal & Palfrey (1996),with some adaptation for the online environment. All experiments were incentive-compatible andused no-deception; the protocol was approved by IRB (Northeastern IRB (#13-13-09)).

    As soon as an online worker accepted the task on AMT, he or she would receive the link toour experimental platform. Upon starting the experiment, each participant would first go throughan instructions page and a brief tutorial alone, before being matched with other players in theinteractive part. Our tutorial was recreated after the tutorial in the original El Gamal & Palfrey(1996) paper, in which participants could choose to play a simulated game as either the Red or theBlue player.

    After completing the tutorial, the participant would be moved into a waiting room. As soon asenough participants were present, a new game room would be launched. The waiting times in thewaiting rooms were short (two to four minutes).

    The interactive experiment would then proceed according to the rules explained in Sec. 3.1.Participants always had their cumulative score, the round count, and the time left for a decisionvisible at the top of the page; the history of own decisions from previous rounds was availableupon clicking on the History button. More details, including the full instructions, how we handleddropouts, and screenshots of the interface, are available in the Sec. A.3 of the Appendix.

    17

  • 6.2 Experiment Results

    In total, we collected behavioral data from 79 human players. The average duration of the interactivepart of the experiment was 5 minutes and 20 seconds; the average payoff per treatment in US Dollarswas: 1.15 (A = 2 and π = 0.5), 1.61 (A = 3.33 and π = 0.5), 1.97 (A = 6 and π = 0.2), 2.20 (A = 6and π = 0.5), and 2.12 (A = 6 and π = 0.8).

    Our goal was to collect as many experimental datasets as necessary to distinguish among thedifferent models. Recall from Sec. 4 that in our experiment a dataset is composed by the joint gamehistories of a group of 10 players. After collecting just one dataset per experimental design, we wereable to find out which model best described human behavior in our game.

    The predicted likelihood of observing the data collected according to the three models are plottedas likelihood odds ratios in Fig. 6. Heuristically, one can think of the plots where the likelihoododds are more different as containing more information. As such, it is encouraging to note thatexperimental designs (3.33, 0.5) and (2.0, 0.5) are the ones where the likelihoods of the differentmodels diverge the most. These designs correspond to regions of high information in our informationsurface. We map the same data onto the information surface in Fig. 7.

    7 Iterating the Procedure

    Across all experiments, we find strong evidence that Model 1 (the Bayes-Nash equilibrium predic-tion) is best suited to describe the observed data (i.e., Model 1 has the highest posterior odds).However, the process for selecting “the best” model never ends as we can never definitively acceptone model as true [23]. The fact that one of the models performs well is not an indication that itwill continue to do so when compared to other models based on new theories. When new theoriesof behavior are proposed, it is then desirable to test these new theories against the best knownalternative. A key advantage of the optimal experimental design approach is that it can be iter-ated, allowing for sequential testing of new models. That is, the approach allows the comparison ofpredictions from new models against the predictions of the most likely model currently available.

    We illustrate this important advantage of the optimal experimental design approach of iterativelyexpanding the set of behavioral models by testing a new theory that was not included in theoriginal work on the imperfect information game: Reinforcement learning. That is, we repeated ouroptimal design protocol and included a fourth model that casts human behavior with a Roth-Erevreinforcement learning model [40].

    Learning theories have become increasingly popular for describing human decision-making acrossa number of settings [3, 41]. Recent reviews of the effectiveness of this model have shown that itis well-suited to describe human behavior in a number of settings, especially those where learningfrom experience is necessary in order to maximize a payout [42]. Given promising results in otherareas, Roth-Erev reinforcement learning proves to be a good candidate to study. Other models tocompare could include those based on Experience-Weighted Attraction learning [43, 44].

    Once this new model is specified (i.e., it is able to assign a likelihood of observing a particulardataset given initial conditions), we are able to seamlessly incorporate it into our optimal experimen-tal design procedure. The only change to this procedure comes when we calculate the informationvalue associated with a given set of experimental design parameters; recall, this is done by calcu-lating the KL divergence (Eq. 2) between the likelihoods (of observing a given dataset) generatedby each of the different models. Whereas the values in the information surface from Figure 3C arebased on a KL divergence between three models, the values of this new information surface would

    18

  • Figure 7: Information surface with corresponding experimental results. Intuitively, whatwe see here validates our expectation of the amount of information gain—the amount of divergence—between the likelihoods of the different models.

    be based on a comparison of four models. In SI Figure A.3, we show this resulting surface, notingthat the addition of the fourth model does not dramatically change the estimates about the optimalexperiments.

    When we include this fourth model in the analysis of empirical data, Model 1 is no longer the bestmodel for describing the data (see Fig. 8). Reinforcement learning explains the data significantlybetter than any of the other models, across all five experimental designs. Through this process,

    19

  • Figure 8: Likelihood odd ratios of models according to real participants behaviorincluding a fourth Model. The unit of observation is a complete player’s 3-round game historyH.For instance, if H = 4, we sampled the game histories of 4 random Blue players and computed thelikelihood of this reduced dataset according to the models. After including Model 4—a reinforcementlearning model—Model 1 is no longer the best-suited for describing this data.

    we found that the new fourth model outperformed the other three models in every experimentthat we ran. The reinforcement model reliably and robustly describes the data from five differentexperiment configurations better than the static equilibrium predictions. This also suggests thatthe reinforcement learning model is quite insensitive to the parameterization of the experiment,indicating its suitability for describing behavior.

    In summary, we make a substantive contribution showing that Roth-Erev reinforcement learn-ing explains behavior in the imperfect information game better than competing models includingfictitious play and Bayes-Nash equilibrium prediction. Observing such a strong fit in an experimentas simple as the Stop-Go game indicates that this model can (and should) be tested in new domainsin order to assess its effectiveness across a number of settings. In a way, this finding is iconic andrepresents the fact that newer, better, more biologically and socially plausible models of behaviorwill emerge as technology and scientific understanding of human behavior co-evolve. Roth-Erevreinforcement learning was one such model that has proven to be effective in describing behavior.

    8 Discussion

    The goal of this paper was to investigate the notion of information gain in experimentation toidentify behavioral types in situations involving strategic interactions. Specifically, we sought toimprove upon a classic optimal experimental design approach where the optimal experiment isdefined as the experiment that maximizes the KL divergence between the likelihoods of multiplecompeting models under a given experimental design. In addition to illustrating and improving uponexisting methods, we asked whether domain experts would predict optimal experimental designs that

    20

  • corresponded to those generated algorithmically. Lastly, we illustrated the potential that optimizingexperiments for information gain has by adding an additional model into the protocol. In a way,this highlights the adaptive nature of optimizing experimental design for information gain—it allowsfor the evaluation of new models of behavior as they are invented.

    By implementing a modified version of the adaptive search GPUCB-PE algorithm, we wereable to actively construct an information landscape—a surface that represents how much infor-mation gain is associated with each possible combination of experimental design parameters. Indoing so, we also eliminated the computationally-costly task of computing model likelihoods forall possibly-observable datasets during the optimal design process. This was achieved by samplingmodel parameters and simulating likely datasets. While this process produced more noisy estimatesof the information gain associated with each experimental design, this noise quickly disappearsthrough repeated (GPUCB-PE) search on the information surface. In practical terms, this reducedthe task of finding the optimal experiment from 72 hours of computation on a super-computingcluster to approximately an hour on a standard laptop, which suggests that this procedure couldbe run in real-time.

    The information landscape we constructed using our Parameter-Sampled GPUCB-PE procedureinformed us about which experimental designs to test in our replication of the Stop-Go game [23].However, we supplemented this list of experimental designs with insight from another source: do-main experts. We conducted a survey, reaching out to over 800 experts in behavioral economics anddigital experimentation to assess whether their domain expertise could associate to optimal pre-dictions about the most informative experiment to run. The expert prediction experiment showedthat there was little consensus among the experts about which experimental design was optimal.Interestingly, the experts generally lacked confidence in their predictions of the optimal experimen-tal design, and this also appeared to make them inclined to advocate for an experiment with higherpayouts to the participants. However in reality, the optimal experiment was predicted to be onewith the smallest payout.

    Ultimately, we ran five experiments in an attempt to distinguish between competing models ofbehavior. We began our optimal design procedure with three competing models; after we collecteddata, we found that one model was well-equipped to explain data from several experiments, but itwas not the most likely model across every experiment we ran. For this reason, we chose to introducea fourth model in our model comparison—a simple reinforcement learning model. Model 4 ended upbeing the model that was best able to explain the observed data across each of the five experiments.This finding is key, and it demonstrates the potential benefits of an optimal design procedure likethe one described here. It enables researchers to quickly introduce new competing models that canrange from subtle tweaks to an existing model through to an entirely new model. All that is neededis to add the new model to the model comparison procedure, find the experimental design that isexpected to be most informative, and evaluate it on existing data or collect a new batch of datato test the models (if the optimal experiment is different from data already collected). Given thatit is unlikely that researchers will discover one best model for describing human behavior, there isa constant drive towards improving upon existing models, exploring for novel ones, and activelycomparing them via the scientific process. Our fourth model outperforms the initial three that weretested, illustrating this point. This also contributes to the growing body of work suggesting thatreinforcement learning is good at describing human behavior [45, 46, 47, 48] by adding experimentalresults for repeated imperfect information games.

    The benefits of rapidly assigning expected information gain to different experimental designs

    21

  • have a very wide reach beyond the current applications of Parameter-Sampled GPUCB-PE. Whilethe current approach is meant for distinguishing between generative models of human behavior andpicking the right experiment to separate the models, it can be adapted to accommodate differentgoals in experimentation, like optimizing the number of participants, optimizing for effect size, oroptimizing for monetary cost. Looking ahead, this is a step in the direction of optimizing socialnetwork experiments where the particular network topology is the experimental design to test. Thisis a notoriously challenging computational task [49, 50] that could be eased using the methodsdescribed here.

    Finally, our procedure can also be used by firms—especially digital companies like Google,Facebook, Twitter, and Microsoft that run thousands of experiments every month [51]—to identifythe most advantageous combinations of parameters for products, user interfaces, and marketingstrategies [52, 8, 53]. Firms that successfully master experiment optimization can enormouslyreduce the costs of exploring less lucrative regions of the parameter space and achieve a durablecompetitive advantage over competitors [54, 55].

    9 Conclusion

    Researchers develop models exemplifying behavioral types in order to understand the mechanismsand decision processes in situations of strategic interactions and information asymmetry. As such,the role of experimentation is paramount for model selection. That is, distinguishing and evaluatingthe effectiveness of different models for describing human behavior. Experimenters can now use thealgorithmic tools described in this paper to rapidly determine the exact experimental design thatwould maximally distinguish between the competing models of behavior. As a proof of principle,we used this tool in a game of imperfect information under several different parameterizations.

    In the future, it is not difficult to imagine the successful implementation of an automated onlineexperimentation platform built using the tools such those described in this work [56]. Such an exper-imental platform would assign the parameters of an experiment, run an optimal experiment, updatethe model predictions, and evaluate the differences between the competing behavioral models. Thiscould be used to search high-dimensional information landscapes, comparing complex models withmany parameters; such complexity would likely need informative priors assigned to the differentmodel parameters in order for the Parameter-Sampled GPUCB-PE search to work best, and futurework will explore such approaches. In a way, this places more emphasis on the scientist’s role as atheoretician, integrating results from optimally-designed experiments into new models that capturethe commonalities and peculiarities of human behavior. Together, they form a human-AI team thatcan help advance social science more rapidly.

    Acknowledgements

    The authors acknowledge Mahmoud El-Gamal for helpful correspondences and Stephanie W. Wangfor useful comments on the design and implementation. This work was supported in part bythe Office of Naval Research (N00014-16-1-3005 and N00014-17-1-2542) and the National DefenseScience & Engineering Graduate Fellowship (NDSEG) Program.

    Conflict of interest

    The authors declare that they have no conflict of interest.

    22

  • Author contributions

    All authors contributed to the study conception and design. All authors contributed to analysesand preparation of the manuscript. S.B. ran the online experiments, and all authors contributed tothe expert surveying.

    Code and data availability

    The optimal experimental design code and the data for replicating the results in this study can befound at http://github.com/shakty/optimal-design.

    References

    [1] Ronald A. Fisher. “The design of experiments”. In: American Mathematical Monthly 43.3(1936), p. 180. doi: 10.2307/2300364.

    [2] Theodore P. Hill. “A statistical derivation of the significant-digit law”. In: Statistical Science10.4 (1995), pp. 354–363. doi: 10.2307/2246134.

    [3] Timothy C. Salmon. “An evaluation of econometric models of adaptive learning”. In: Econo-metrica 69.6 (2001), pp. 1597–1628. doi: 10.1111/1468-0262.00258.

    [4] Ron Berman, Leonid Pekelis, Aisling Scott, and Christophe Van den Bulte. “p-Hacking andfalse discovery in A/B testing”. In: Available at SSRN (2018). doi: 10.2139/ssrn.3204791.

    [5] Andrey Rzhetsky, Jacob G. Foster, Ian T. Foster, and James A. Evans. “Choosing experimentsto accelerate collective discovery”. In: Proceedings of the National Academy of Sciences (2015),pp. 1–6. doi: 10.1073/pnas.1509757112.

    [6] Dean Eckles and Maurits C. Kaptein. “Thompson sampling with the online bootstrap”. In:arXiv (2014), pp. 1–13. url: arxiv.org/abs/1410.4009.

    [7] Benjamin Letham, Brian Karrer, Guilherme Ottoni, and Eytan Bakshy. “Constrained Bayesianoptimization with noisy experiments”. In: arXiv (2017), pp. 1–20. url: arxiv.org/abs/1706.07094.

    [8] Eric M. Schwartz, Eric T. Bradlow, and Peter S. Fader. “Customer acquisition via display ad-vertising using multi-armed bandit experiments”. In: Marketing Science 36.4 (2017), pp. 500–522. doi: 10.1287/mksc.2016.1023.

    [9] Sharon Zhou, Melissa Valentine, and Michael S. Bernstein. “In search of the dream team:Temporally constrained multi-armed bandits for identifying effective team structures”. In:Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 2018,pp. 1–13. doi: 10.1145/3173574.3173682.

    [10] Stephanie W. Wang, Michelle Filiba, and Colin F. Camerer. “Dynamically optimized sequen-tial experimentation (DOSE) for estimating economic preference parameters”. In: workingpaper (2010), pp. 1–41. url: https://bit.ly/39iPvkw.

    [11] Taisuke Imai and Colin F. Camerer. “Estimating time preferences from budget set choicesusing optimal adaptive design”. In: working paper (2018). url: taisukeimai.com/files/adaptive_ctb.pdf.

    23

    http://github.com/shakty/optimal-designhttp://dx.doi.org/10.2307/2300364http://dx.doi.org/10.2307/2246134http://dx.doi.org/10.1111/1468-0262.00258http://dx.doi.org/10.2139/ssrn.3204791http://dx.doi.org/10.1073/pnas.1509757112arxiv.org/abs/1410.4009arxiv.org/abs/1706.07094arxiv.org/abs/1706.07094http://dx.doi.org/10.1287/mksc.2016.1023http://dx.doi.org/10.1145/3173574.3173682https://bit.ly/39iPvkwtaisukeimai.com/files/adaptive_ctb.pdftaisukeimai.com/files/adaptive_ctb.pdf

  • [12] Shakoor Pooseh, Nadine Bernhardt, Alvaro Guevara, Quentin J. M. Huys, and Michael N.Smolka. “Value-based decision-making battery: A Bayesian adaptive approach to assess im-pulsive and risky behavior”. In: Behavior Research Methods 50.1 (2018), pp. 236–249. doi:10.3758/s13428-017-0866-x.

    [13] Jonathan Chapman, Erik Snowberg, Stephanie Wang, and Colin Camerer. “Loss attitudes inthe US population: Evidence from dynamically optimized sequential experimentation (DOSE)”.In: National Bureau of Economic Research (2018), pp. 1–55. doi: 10.3386/w25072.

    [14] Paul A. David. “Clio and the economics of QWERTY”. In: American Economic Review 75.2(1985), pp. 332–337. url: jstor.org/stable/1805621.

    [15] Michael L. Katz and Carl Shapiro. “Network externalities, competition, and compatibility”.In: American Economic Review 75.3 (1985), pp. 424–440. url: jstor.org/stable/1814809.

    [16] Yann Bramoullé, Habiba Djebbari, and Bernard Fortin. “Peer Effects in Networks: A Survey”.In: CEPR Discussion Paper No. DP14260 (2020). url: ftp.iza.org/dp12947.pdf.

    [17] Colin F. Camerer. Behavioral game theory: Experiments in strategic interaction. PrincetonUniversity Press, 2011. isbn: 9780691090399. url: psycnet.apa.org/record/2003-06054-000.

    [18] Drew Fudenberg and Jean Tirole. Game theory. MIT Press, 1991. url: mitpress.mit.edu/books/game-theory.

    [19] Edward M. Tauber. “Why do people shop?” In: Journal of Marketing 36.4 (1972), pp. 46–49.doi: 10.2307/1250426.

    [20] Duncan Sheppard Gilchrist and Emily Glassberg Sands. “Something to talk about: Socialspillovers in movie consumption”. In: Journal of Political Economy 124.5 (2016), pp. 1339–1382. doi: 10.1086/688177.

    [21] John C. Harsanyi. “Games with incomplete information played by “Bayesian” players, Part I.The basic model”. In: Management Science 14.3 (1967), pp. 159–182. doi: 10.1287/mnsc.1040.0270.

    [22] David P. McIntyre and Asda Chintakananda. “Competing in network markets: Can the winnertake all?” In: Business Horizons 57.1 (2014), pp. 117–125. doi: 10.1016/j.bushor.2013.09.005.

    [23] Mahmoud A. El-Gamal and Thomas R. Palfrey. “Economical experiments: Bayesian efficientexperimental design”. In: International Journal of Game Theory 25 (1996), pp. 495–517. doi:10.1007/BF01803953.

    [24] Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. “Parallel Gaussianprocess optimization with upper confidence bound and pure exploration”. In: Machine Learn-ing and Knowledge Discovery in Databases. Berlin, Heidelberg: Springer Berlin Heidelberg,2013, pp. 225–240. isbn: 978-3-642-40988-2. doi: 10.1007/978-3-642-40988-2\_15.

    [25] Soloman Kullback and Richard A. Leibler. “On information and sufficiency”. In: The Annalsof Mathematical Statistics 22.1 (1951), pp. 79–86. doi: 10.1214/aoms/1177729694.

    [26] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machinelearning. The MIT Press, 2005. isbn: 026218253X. url: gaussianprocess.org/gpml/.

    24

    http://dx.doi.org/10.3758/s13428-017-0866-xhttp://dx.doi.org/10.3386/w25072jstor.org/stable/1805621jstor.org/stable/1814809ftp.iza.org/dp12947.pdfpsycnet.apa.org/record/2003-06054-000psycnet.apa.org/record/2003-06054-000mitpress.mit.edu/books/game-theorymitpress.mit.edu/books/game-theoryhttp://dx.doi.org/10.2307/1250426http://dx.doi.org/10.1086/688177http://dx.doi.org/10.1287/mnsc.1040.0270http://dx.doi.org/10.1287/mnsc.1040.0270http://dx.doi.org/10.1016/j.bushor.2013.09.005http://dx.doi.org/10.1016/j.bushor.2013.09.005http://dx.doi.org/10.1007/BF01803953http://dx.doi.org/10.1007/978-3-642-40988-2\_15http://dx.doi.org/10.1214/aoms/1177729694gaussianprocess.org/gpml/

  • [27] Jochen Görtler, Rebecca Kehlbeck, and Oliver Deussen. “A visual exploration of Gaussianprocesses”. In: Distill (2019). doi: 10.23915/distill.00017.

    [28] Themistoklis S. Stefanakis, Emile Contal, Nicolas Vayatis, Frédéric Dias, and Costas E. Syno-lakis. “Can small islands protect nearby coasts from tsunamis? An active experimental designapproach”. In: Proceedings of the Royal Society A 470.2172 (2014), pp. 1–20. doi: 10.1098/rspa.2014.0575.

    [29] Ilya M. Sobol. “On quasi-Monte Carlo integrations”. In: Mathematics and Computers in Sim-ulation 47.2 (1998), pp. 103–112. doi: 10.1016/S0378-4754(98)00096-2.

    [30] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. “Gaussian processoptimization in the bandit setting: No regret and experimental design”. In: Proceedings of the27th International Conference on International Conference on Machine Learning. ICML ’10.2010, pp. 1015–1022. doi: 10.5555/3104322.3104451.

    [31] Nando De Freitas, Alex J. Smola, and Masrour Zoghi. “Exponential regret bounds for Gaussianprocess bandits with deterministic observations”. In: Proceedings of the 29th InternationalCoference on International Conference on Machine Learning. ICML ’12. 2012, pp. 955–962.isbn: 9781450312851. doi: 10.5555/3042573.3042697.

    [32] Stefano DellaVigna and Devin Pope. “What motivates effort? Evidence and expert forecasts”.In: Review of Economic Studies 85.2 (2017), pp. 1029–1069. doi: 10.1093/restud/rdx033.

    [33] Marc Knez and Colin F. Camerer. “Creating expectational assets in the laboratory: Coordina-tion in ‘weakest-link’ games”. In: Strategic Management Journal 15.1 S (1994), pp. 101–119.doi: 10.1002/smj.4250150908.

    [34] Ralph Hertwig and A. Ortmann. “Experimental practices in economics: a methodologicalchallenge for psychologists?” In: Behavioral and Brain Sciences 24.3 (2001), pp. 383–403. doi:10.1037/e683322011-032.

    [35] Steven J. Kachelmeier and Kristy L. Towry. “The limitations of experimental design: A casestudy involving monetary incentive effects in laboratory markets”. In: Experimental Economics8.1 (2005), pp. 21–33. doi: 10.1007/s10683-005-0435-5.

    [36] John J. Horton, David G. Rand, and Richard J. Zeckhauser. “The online laboratory: Conduct-ing experiments in a real labor market”. In: Experimental Economics 14.3 (2011), pp. 399–425.doi: 10.1007/s10683-011-9273-9.

    [37] Winter Mason and Siddharth Suri. “Conducting behavioral research on Amazon’s MechanicalTurk”. In: Behavior Research Methods 44.1 (2012), pp. 1–23. doi: 10.3758/s13428-011-0124-6.

    [38] Gabriele Paolacci, Jesse Chandler, and Panagiotis G. Ipeirotis. “Running experiments onAmazon Mechanical Turk”. In: Judgment and Decision Making 5.5 (2010), pp. 411–419. url:psycnet.apa.org/record/2010-18204-008.

    [39] Stefano Balietti. “nodeGame: Real-time, synchronous, online experiments in the browser”. In:Behavior Research Methods i (2017), pp. 1–31. doi: 10.3758/s13428-016-0824-z.

    [40] Ido Erev and Alvin E. Roth. “Predicting how people play games: Reinforcement learning ingames with unique strategy equilibrium”. In: American Economic Review 88.4 (1998), pp. 848–881. url: jstor.org/stable/117009.

    25

    http://dx.doi.org/10.23915/distill.00017http://dx.doi.org/10.1098/rspa.2014.0575http://dx.doi.org/10.1098/rspa.2014.0575http://dx.doi.org/10.1016/S0378-4754(98)00096-2http://dx.doi.org/10.5555/3104322.3104451http://dx.doi.org/10.5555/3042573.3042697http://dx.doi.org/10.1093/restud/rdx033http://dx.doi.org/10.1002/smj.4250150908http://dx.doi.org/10.1037/e683322011-032http://dx.doi.org/10.1007/s10683-005-0435-5http://dx.doi.org/10.1007/s10683-011-9273-9http://dx.doi.org/10.3758/s13428-011-0124-6http://dx.doi.org/10.3758/s13428-011-0124-6psycnet.apa.org/record/2010-18204-008http://dx.doi.org/10.3758/s13428-016-0824-zjstor.org/stable/117009

  • [41] Michael Foley, Patrick Forber, Rory Smead, and Christoph Riedl. “Conflict and conventionin dynamic networks”. In: Journal of the Royal Society Interface 15.140 (2018), p. 20170835.doi: 10.1098/rsif.2017.0835.

    [42] Ido Erev and Alvin E. Roth. “Maximization, learning, and economic behavior”. In: Proceedingsof the National Academy of Sciences 111 (2014), pp. 10818–10825. doi: 10.1073/pnas.1402846111.

    [43] Colin F. Camerer and Teck-Hua Ho. “Experience-weighted attraction learning in normal formgames”. In: Econometrica 67.4 (1999), pp. 827–874. doi: 10.1111/1468-0262.00054.

    [44] Teck-Hua Ho, Xin Wang, and Colin F. Camerer. “Individual differences in EWA learningwith partial payoff information”. In: The Economic Journal 118.525 (2008), pp. 37–59. doi:10.1111/j.1468-0297.2007.02103.x.

    [45] Nick Feltovich. “Reinforcement-based versus belief-based learning models in experimentalasymmetric-information games”. In: Econometrica 68.3 (2000), pp. 605–641. doi: 10.1111/1468-0262.00125.

    [46] John Gale, Kenneth G Binmore, and Larry Samuelson. “Learning to be imperfect: The ultima-tum game”. In: Games and Economic Behavior 8.1 (1995), pp. 56–90. doi: 10.1016/S0899-8256(05)80017-X.

    [47] Rajiv Sarin and Farshid Vahid. “Predicting how people play games: a simple dynamic modelof choice”. In: Games and Economic Behavior 34.1 (2001), pp. 104–122. doi: 10.1006/game.1999.0783.

    [48] Chaim Fershtman and Ariel Pakes. “Dynamic games with asymmetric information: A frame-work for empirical work”. In: The Quarterly Journal of Economics 127.4 (2012), pp. 1611–1661. doi: 10.1093/qje/qjs025.

    [49] Tuan Q. Phan and Edoardo M. Airoldi. “A natural experiment of social network formationand dynamics”. In: Proceedings of the National Academy of Sciences 112.21 (2015), pp. 6595–6600. doi: 10.1073/pnas.1404770112.

    [50] Ben M. Parker, Steven G. Gilmour, and John Schormans. “Optimal design of experimentson connected units with application to social networks”. In: Journal of the Royal StatisticalSociety C 66.3 (2017), pp. 455–480. doi: 10.1111/rssc.12170.

    [51] Matthew Goldman and Justin Rao. “Experiments as instruments: Heterogeneous positioneffects in sponsored search auctions”. In: EEAI Endorsed Transactions on Serious Games3.11 (2016). doi: 10.4108/eai.8-8-2015.2261043.

    [52] Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. “Controlled ex-periments on the web: Survey and practical guide”. In: Data Mining and Knowledge Discovery18.1 (2009), pp. 140–181. doi: 10.1007/s10618-008-0114-1.

    [53] Ron Berman. “Beyond the last touch: Attribution in online advertising”. In: Marketing Science37.5 (2018), pp. 771–792. doi: 10.1287/mksc.2018.1104.

    [54] Ron Kohavi and Stefan Thomke. “The surprising power of online experiments”. In: HarvardBusiness Review 95.5 (2017), pp. 2–9. url: hbr.org/2017/09/the-surprising-power-of-online-experiments.

    26

    http://dx.doi.org/10.1098/rsif.2017.0835http://dx.doi.org/10.1073/pnas.1402846111http://dx.doi.org/10.1073/pnas.1402846111http://dx.doi.org/10.1111/1468-0262.00054http://dx.doi.org/10.1111/j.1468-0297.2007.02103.xhttp://dx.doi.org/10.1111/1468-0262.00125http://dx.doi.org/10.1111/1468-0262.00125http://dx.doi.org/10.1016/S0899-8256(05)80017-Xhttp://dx.doi.org/10.1016/S0899-8256(05)80017-Xhttp://dx.doi.org/10.1006/game.1999.0783http://dx.doi.org/10.1006/game.1999.0783http://dx.doi.org/10.1093/qje/qjs025http://dx.doi.org/10.1073/pnas.1404770112http://dx.doi.org/10.1111/rssc.12170http://dx.doi.org/10.4108/eai.8-8-2015.2261043http://dx.doi.org/10.1007/s10618-008-0114-1http://dx.doi.org/10.1287/mksc.2018.1104hbr.org/2017/09/the-surprising-power-of-online-experimentshbr.org/2017/09/the-surprising-power-of-online-experiments

  • [55] Eduardo M. Azevedo, Alex Deng, Jose Luis Montiel Olea, Justin Rao, and E. Glen Weyl.“A/B Testing with Fat Tails”. In: Journal of Political Economy (2019). doi: 10.1086/710607.

    [56] Eytan Bakshy, Lili Dworkin, Brian Karrer, Konstantin Kashin, Benjamin Letham, AshwinMurthy, and Shaun Singh. “AE: A domain-agnostic platform for adaptive experimentation”.In: Conference on Neural Information Processing Systems (2018), pp. 1–8. url: eytan .github.io/papers/ae_workshop.pdf.

    [57] B. MacDoanld, H. Chipman, C. Campbell, and P. Ranjan. GPfit: Gaussian process modeling.R package version 1.0-8 — For new features, see the ‘Changelog’ file (in the package source).2019. url: cran.r-project.org/web/packages/GPfit/index.html.

    [58] Antonio A. Arechar, Simon Gächter, and Lucas Molleman. “Conducting interactive experi-ments online”. In: Experimental Economics 21.1 (2018), pp. 99–131. doi: 10.1007/s10683-017-9527-2.

    [59] Neil Stewart, Jesse Chandler, and Gabriele Paolacci. “Crowdsourcing samples in cognitivescience”. In: Trends in Cognitive Sciences 21.10 (2017), pp. 736–748. doi: 10.1016/j.tics.2017.06.007.

    27

    http://dx.doi.org/10.1086/710607eytan.github.io/papers/ae_workshop.pdfeytan.github.io/papers/ae_workshop.pdfcran.r-project.org/web/packages/GPfit/index.htmlhttp://dx.doi.org/10.1007/s10683-017-9527-2http://dx.doi.org/10.1007/s10683-017-9527-2http://dx.doi.org/10.1016/j.tics.2017.06.007http://dx.doi.org/10.1016/j.tics.2017.06.007

  • A Appendix

    A.1 GPUCB-PE Algorithm

    The Gaussian Process Upper Confidence Bound-Pure Exploration (GPUCB-PE) search algorithmis described herein. In our case, the algorithm iteratively searches different points on an informationlandscape where each point is a coordinate, a pair of parameter values (A, π) that together definea single experimental design, θ. The value associated with any given point in this landscape is theinformation gain, I(x, θ), defined in Equation 2 in the main text.

    Locations to search

    Rewa

    rd GPUCB-PE Algorithm:1. Sample an initial number of points2. Using GP, estimate y and 3. UCB: search where y + is highest4. PE: search k times where is highest*5. Repeat from Step 2 until the stopping

    rule** has been satisfied

    * within the "relevant region", see Stefanakis et al., 2014** stopping rule is based on a modified Spearman

    rank coefficient, see Stefanakis et al., 2014

    True FunctionEstimated Function (y)Sampled Points (from start to t)max(y + ) (search at t+1)max( ) (search at t+2)Confidence Bound

    Figure A.1: Schematic representation of the GPUCB-PE algorithm. Gaussian Processregression has recently emerged as an effective tool for searching complex landscapes. As such, thereare numerous software packages that simplify this task. In the current work, we rely on the GPfitpackage in the R statistical computing library [57], making the computations behind Step 2 (usingGP, estimate ŷ and σ̂) straightforward.

    1. Sample an initial number of points from the landscape and calculate the associated informationgain, I(x, θ), of each.

    2. With these initial points, use Gaussian Processes to construct an estimate, ŷ, for the value ofeach other point in the landscape, along with the variance, σ̂, associated with each ŷ estimate.

    3. GPUCB-PE is associated with two types of search: a “UCB” search and a “PE” search. Eachiteration of the algorithm performs one search at the coordinate associated with the highestŷ + UCB value, where UCB is the upper confidence bound defined using the σ̂ from theGaussian Process in Step 2.

    4. The algorithm then selects k “pure exploration” (PE) points to search (for more complexlandscapes a higher k may be needed, providing more exploration and making the algorithm

    28

  • more likely to converge to the global maximum). Each PE search selects a point in thelandscape with the highest σ̂ value within a “relevant region” so as to minimize the mostuncertainty in the estimate of the landscape9

    5. The algorithm then calculates the I(x, θ) of each newly sampled point and runs anotherGaussian Process model, producing new estimates for ŷ and σ̂ that account for the additionof the new points.

    6. The algorithm continues from Step 3 unless the stopping rule is satisfied. This stopping ruleis based off of a modified Spearman rank correlation coefficient, comparing the relative valuesof each of the points in the previous estimate of the landscape to the ranking of the samepoints in the current estimate of the landscape. This measure quantifies the dissimilaritybetween subsequent estimates of the landscape. The GPUCB-PE algorithm will stop if thatmeasure is less than the stopping parameter, ρ, for a number of iterations in a row, L. Thesetwo values are, in practice, set to be L = 4 and ρ = 0.001. This stopping rule ensures thatthe GPUCB-PE algorithm will not continue to search uninformative points on the landscape,which is usually associated with its convergence on the global maximum in the landscape.

    See Figure A.1.9Note: the “relevant region” is introduced so as to avoid searching points that have high uncertainty, σ̂, but low

    value, ŷ. The region includes every point with a ŷ+UCB value that is higher than the point with the largest ŷ−LCBvalue, the largest lower confidence bound). See [24, 28] for a more detailed description of the PE step in GPUCB-PE.

    29

  • A.1.1 Parameter-Sampled GPUCB-PE Regret

    Figure A.2: Relationship between the number of sampled datasets per search inParameter-Sampled GPUCB-PE (1000, 5000, or 10,000), the number of iterationsneeded to converge, and the regret (each simulated 200 times). Sampling more datasetstends to require more searches in order for the algorithm to converge, while also further minimizingregret. Note: the differences in regret in this plot are due to the fact that sampling fewer datasetsproduces a relatively-higher mean squared error, but crucially, each of these simulations convergedto the true glob