Economic Theory and Statistical Learning

Economic Theory and Statistical Learning

CitationLiang, Annie. 2016. Economic Theory and Statistical Learning. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Permanent linkhttp://nrs.harvard.edu/urn-3:HUL.InstRepos:33493561

Terms of UseThis article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

Share Your StoryThe Harvard community has made this article openly available.Please share how this access benefits you. Submit a story .

Accessibility

http://nrs.harvard.edu/urn-3:HUL.InstRepos:33493561

http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

http://osc.hul.harvard.edu/dash/open-access-feedback?handle=&title=Economic%20Theory%20and%20Statistical%20Learning&community=1/1&collection=1/4927603&owningCollection1/4927603&harvardAuthors=8f7c203c86c2e89ae6219fed460afc0a&departmentEconomics

https://dash.harvard.edu/pages/accessibility


A dissertation presented

by

Annie Liang

to

The Department of Economics

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in the subject of

Economics

Harvard University

Cambridge, Massachusetts

May 2016

c� 2016 Annie Liang

All rights reserved.

Dissertation Advisors:Professor Drew FudenbergProfessor Jerry Green

Author:Annie Liang


Abstract

This dissertation presents three independent essays in microeconomic theory. Chap-

ter 1 suggests an alternative to the common prior assumption, in which agents

form beliefs by learning from data, possibly interpreting the data in different ways.

In the limit as agents observe increasing quantities of data, the model returns strict

solutions of a limiting complete information game, but predictions may diverge

substantially for small quantities of data. Chapter 2 (with Jon Kleinberg and

Sendhil Mullainathan) proposes use of machine learning algorithms to construct

benchmarks for “achievable" predictive accuracy. The paper illustrates this ap-

proach for the problem of predicting human-generated random sequences. We

find that leading models explain approximately 10-15% of predictable variation

in the problem. Chapter 3 considers the problem of how to interpret inconsistent

choice data, when the observed departures from the standard model (perfect maxi-

mization of a single preference) may emerge either from context-dependencies in

preference or from stochastic choice error. I show that if preferences are “simple"

in the sense that they consist only of a small number of context-dependencies, then

the analyst can use a proposed optimization problem to recover the true number

of underlying context-dependent preferences.

iii

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

Introduction 1

1 Games of Incomplete Information Played by Statisticians 31.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Preliminaries and Notation . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3.1 The game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.2 Beliefs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3.3 Solution concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.4 Learning from Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.4.1 When do agents commonly learn? . . . . . . . . . . . . . . . . 21

1.5 Robustness to Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 231.5.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.5.2 Bayesian Nash Equilibrium . . . . . . . . . . . . . . . . . . . . 271.5.3 Rationalizable Actions . . . . . . . . . . . . . . . . . . . . . . . 30

1.6 How Much Data do Agents Need? . . . . . . . . . . . . . . . . . . . . 371.6.1 Bayesian Nash Equilibrium . . . . . . . . . . . . . . . . . . . . 371.6.2 Rationalizable Actions . . . . . . . . . . . . . . . . . . . . . . . 411.6.3 Diversity across Inference Rules in M . . . . . . . . . . . . . . 42

1.7 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441.7.1 Misspecification . . . . . . . . . . . . . . . . . . . . . . . . . . . 451.7.2 Private Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461.7.3 Limit Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . 47

1.8 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471.8.1 Robustness of equilibrium and equilibrium refinements . . . 47

iv

1.8.2 Role of higher-order beliefs . . . . . . . . . . . . . . . . . . . . 491.8.3 Agents who learn from data . . . . . . . . . . . . . . . . . . . 501.8.4 Model uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 511.8.5 Epistemic game theory . . . . . . . . . . . . . . . . . . . . . . . 52

1.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2 The Theory is Predictive, but is it Complete? An Application to HumanPerception of Randomness 542.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.2 Primary setting: human generation of coin flips . . . . . . . . . . . . 59

2.2.1 Description of data . . . . . . . . . . . . . . . . . . . . . . . . . 592.2.2 Theories of misperception . . . . . . . . . . . . . . . . . . . . . 612.2.3 Prediction tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 632.2.4 Establishing a benchmark . . . . . . . . . . . . . . . . . . . . . 662.2.5 Other possible benchmarks . . . . . . . . . . . . . . . . . . . . 67

2.3 Transfer across Domains . . . . . . . . . . . . . . . . . . . . . . . . . . 702.3.1 Prediction of New Alphabets . . . . . . . . . . . . . . . . . . . 712.3.2 Prediction of Subsequent Flips . . . . . . . . . . . . . . . . . . 73

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752.4.1 Guarantees on the benchmark . . . . . . . . . . . . . . . . . . 752.4.2 Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762.4.3 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . 76

2.5 Relationship to Literature . . . . . . . . . . . . . . . . . . . . . . . . . 772.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3 Interpretation of Inconsistent Choice Data: How Many Context-DependentPreferences are There? 793.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873.5 Recovery Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.5.1 Class of choice models . . . . . . . . . . . . . . . . . . . . . . . 903.5.2 Can we recover the number of orderings? . . . . . . . . . . . 913.5.3 Can we recover more? . . . . . . . . . . . . . . . . . . . . . . . 98

3.6 Relationship to Literature . . . . . . . . . . . . . . . . . . . . . . . . . 102

v

References 105

Appendix A Appendix to Chapter 1 111A.1 Notation and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 111A.2 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112A.3 Appendix C: Main Results . . . . . . . . . . . . . . . . . . . . . . . . . 114

A.3.1 Proof of Claim 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 114A.3.2 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . 115A.3.3 Proof of Claim 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 118A.3.4 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . 120A.3.5 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . 124A.3.6 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . 125A.3.7 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . . 128

A.4 Appendix D: An example illustrating the fragility of weak strict-rationalizability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Appendix B Appendix to Chapter 2 133B.1 Experiment Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . 133B.2 Behavioral Prediction Rules . . . . . . . . . . . . . . . . . . . . . . . . 133

Appendix C Appendix to Chapter 3 137C.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

C.1.1 Preliminary Notation and Results . . . . . . . . . . . . . . . . 137C.1.2 Main Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

C.2 Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142C.3 Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142C.4 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

vi

List of Tables

2.1 The empirical probability of Heads, conditional on three fixed previous flips: (1)

the actual proportion of generated Heads in our data, (2) the assessed probability

of Heads next flip from Rapaport & Budescu (1997), as presented in Rabin and

Vayanos (2010), (3) probabilities consistent with a Bernoulli(0.5) process. . . . . . 612.2 Prediction errors achieved using Rabin (2002) and Rabin (2010) are improvements

on the prediction error achieved by guessing at random. How do we assess the size

of this improvement? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652.3 Comparison of prediction error achieved using behavioral models with prediction

error achieved using table lookup. The behavioral models explain between 9-12%

of the explainable variation in the data. . . . . . . . . . . . . . . . . . . . . . 672.4 Comparison of prediction error achieved using behavioral models with prediction

error achieved using table lookup. . . . . . . . . . . . . . . . . . . . . . . . . 692.5 We train table lookup and our two behavioral models on the original 8-length

{H, T} data, and then use the estimated data to predict 8-length {r, 2} and {@, !}data. Reported prediction errors are tenfold cross-validated mean squared errors. . 73

2.6 We train table lookup and our two behavioral models on the original 8-length

{H, T} data, and then use the estimated data to predict the data in {Dk:k+7}8k=2.

Reported prediction errors are tenfold cross-validated mean squared errors. . . . 74

vii

List of Figures

1.1 Two relevant attributes (r = 2). Yield is high under environmental conditions in

[�c0, c0]2, and low otherwise. Farmers do not know the high yield region (shaded). 91.2 Circles indicate low yields, and squares indicate high yields. The two rectangles

identify partitionings (predict high yield if x is contained within the rectangle, and

low yield otherwise) that exactly fit the data. . . . . . . . . . . . . . . . . . . . 101.3 Every axis-aligned rectangle partitioning predicts high yield at the origin, but there

exists a rotated rectangle partitioning that predicts low yield. . . . . . . . . . . . 131.4 The map h takes first-order beliefs µ into expected payoff functions. . . . . . . . 281.5 The set UR

a⇤iis partitioned such that every agent’s set of rationalizable actions is

constant across each element of the partition. There are three cases: (1) if u⇤ is on

the boundary of URa⇤i

(e.g. u1), then a⇤i is not robust to inference; (2) if u⇤ is in the

interior of URa⇤i

, and moreover in the interior of a partition element (u2), then a⇤i is

certainly robust to inference; (3) if u⇤ in the interior of URa⇤i

, but not in the interior of

any partition element (u3), then a⇤i may not be robust to inference. See Appendix D

for an example of the last case. . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.1 (a) Top row. Distribution of the number of heads in the realized string. Left: Com-

parison of MTurk data with theoretical Bernoulli predictions. Right: Comparison of

Nickerson & Butler (2009) data with theoretical Bernoulli predictions. (b) Bottom

row. Distribution of proportion of runs which are of length m. Left: Comparison of

MTurk data with theoretical Bernoulli predictions. Right: Comparison of Nickerson

& Butler (2009) data with theoretical Bernoulli predictions. . . . . . . . . . . . . 60

3.1 The problem in (3.2) returns a solution with 2 orderings if and only if the line with

normal vector (�1,�l) supports F at (2, D2). . . . . . . . . . . . . . . . . . . 893.2 Studying rationalizability of a dataset is equivalent to studying colorability of a

graph in which nodes represent observations and edges represent inconsistencies. 96

viii

C.1 Any choice of l for which (�1,�l) is a subgradient of fH at (K, DK,H) will recover

K. With high probability, the set of vectorsn

(�1,� 1(p+d)

) | d 2⇣

0, d(1�p)K

M � 2p � b⌘o

is a subset of the subdifferential of fH at (K, DK,H). . . . . . . . . . . . . . . . . 139

ix

Acknowledgments

I am deeply grateful to my advisors, Drew Fudenberg, Jerry Green, David Laibson,

and Sendhil Mullainathan. Sendhil Mullainathan’s support was essential in my

path to and through graduate school, and his unique insight is a singular inspiration.

David Laibson’s advising, along with his thoughtful example, has been the single

most important input into how I think about the purpose of my work, and what

I hope to contribute. Jerry Green guided me as a first-year graduate student,

helping me to discover the joys and challenges of research, and his generous

mentorship has been invaluable since then. Drew Fudenberg’s dedicated training,

mentorship, and guidance have been crucial to my development as a researcher,

and the development of my research identity.

In addition to my advisors, I am thankful also to Andrei Shleifer, Tomasz

Strzalecki, and Eric Maskin for many important conversations throughout graduate

school.

Finally, I am above all grateful to my parents Bo Liu and Zhi-Pei Liang, and to

my brother, Danny Liang. I love them dearly.

x

To my parents and brother

xi

Introduction

This dissertation comprises three independent chapters that study topics at the

intersection of economic theory and statistical learning.

My job market paper (“Games with Incomplete Information Played by Statisti-

cians") develops a new framework for modeling beliefs in incomplete information

games. The standard approach assumes that agents share a common prior distribu-

tion over uncertainty, and thus have a common (ex-ante) model of the world. It

is known that this assumption has several strong implications that conflict with

empirical evidence—in particular, it precludes public and persistent disagreement.

However, the question of what kind of heterogeneity in beliefs to allow, and how

to do so in a structured way, remains open. My paper proposes a reformulation

of incomplete information games, in which agents form beliefs by learning from

data. The key feature of this approach is that, in the absence of a “privileged"

or “correct" model for interpreting data, the kind of disagreement that the stan-

dard model precludes arises naturally as a consequence of statistical ambiguity.

I use this framework to study the robustness of predictions (under the standard

model) to the introduction of statistical uncertainty, and additionally propose a

criterion for equilibrium selection that removes solutions that are only supported

by unreasonably large quantities of data.

The second segment (“The Theory is Predictive, but is it Complete? An Ap-

1

plication to Human Perception of Randomness," with Jon Kleinberg and Sendhil

Mullainathan) develops a way to measure the “completeness" of an economic the-

ory. Current methods for testing theories of economic behavior focus on whether

the predictions of the theory match what we see in the available data. But we also

care about the extent to which the predictable variation in data is captured by the

theory. This second property is difficult to measure, because in general we do not

know how much “predictable variation" there is in the problem. We propose the

use of machine learning algorithms to construct a benchmark for the “achievable

level of prediction," and illustrate this approach on the problem of predicting

human generation of random sequences. We find that leading behavioral models

explain approximately 10-15% of the variation in the data explained using an

atheoretical machine learning algorithm. This suggests that there is remaining

predictable structure in the problem to be uncovered.

The final segment of my work involves the discovery of context-dependencies

in preference. In “Interpretation of Inconsistent Choice: How Many Context-

Dependent Preferences are There?", I consider the problem of how to interpret

inconsistent choice data, when the observed departures from the standard model

(perfect maximization of a single preference) may emerge either from context-

dependencies in preference or from stochastic choice error. I show that if pref-

erences are “simple" in the sense that they consist only of a small number of

context-dependencies, then the analyst can use a proposed optimization problem

to recover the true number of underlying context-dependent preferences.

2

Chapter 1

Games of Incomplete Information

Played by Statisticians

1.1 Introduction

In games with a payoff-relevant parameter, players’ beliefs about this parameter,

as well as their beliefs about opponent beliefs about this parameter, are important

for predictions of play. The standard approach to restricting the space of beliefs

assumes that players share a common prior distribution.1 This assumption is

known to have strong implications, including that beliefs that are commonly

known must be identical (Aumann, 1976), and repeated communication of beliefs

will eventually lead to agreement (Geanakoplos and Polemarchakis, 1982). These

properties conflict not only with considerable empirical evidence of public and

persistent disagreement,2 but also with the more basic, day-to-day, experience that

1The related, stronger, notion of rational expectations assumes moreover that this common priordistribution is in fact the “true" distribution shared by the modeler.

2In financial markets, agents publicly disagree in their interpretations of earnings announcements(Kandel and Pearson, 1995), valuations of financial assets (Carlin et al., 2013), forecasts for inflation

3

people sometimes come to different conclusions given the same information.

As a consequence, the following questions arise: When is disagreement a feature

of agent’s beliefs, and how can this disagreement be predicted from the primitives

of the economic environment? Can we relax the common prior assumption to

accommodate (commonly known) disagreement in a structured way? Finally, when

are strategic predictions robust to relaxations of the common prior assumption?

Towards the first questions of modeling and predicting disagreement, I propose

a reformulation of incomplete information in which agents form beliefs by learning

from data. I take data be a random sequence of observations, drawn i.i.d. from an

exogenous distribution P, and define an inference rule to be any map from possible

datasets into distributions over the parameter space. (For example, we can think of

data as historical stock prices, and inference rules as maps from possible time-series

of stock returns to distributions over returns next period.)

This perspective on beliefs provides a way to rationalize disagreement—in

the absence of a “privileged" or “correct" inference rule, different interpretations

of common data is not only possible, but even natural.3 The key restriction I

impose to structure this approach is that while agents may learn from data using

different inference rules, they have common certainty in the predictions of a family

of plausible inference rules.4 This assumption is referred to as common inference.

In the main part of the paper, I additionally assume a condition on the family of

(Mankiw et al., 2004), forecasts for stock movements (Yu, 2011), and forecasts for mortgage loanprepayment speeds (Carlin et al., 2014). Agents publicly disagree also in matters of politics (Wiegel,2009) and climate change (Marlon et al., 2013).

3Indeed, this perspective has been taken in work by Al-Najjar (2009), Gilboa et al. (2013), andAcemoglu et al. (2015), among others, in various nonstrategic settings (see Section 9.3 for an extendedreview). I embed these ideas into an incomplete information game, and study their implications forstrategic behavior.

4Reflecting, for example, common cultural experiences or industry-specific norms.

4

inference rules (uniform consistency5) that implies that agents commonly learn the

true parameter (see Proposition 1).6 In this framework, complete information is

interpreted as a reduced form for agents having beliefs coordinated by an infinite

quantity of data.7

Towards the second question of robustness to the common prior assumption, I

propose a new robustness criterion for strategic predictions based in the quantity of

data that agents need to see. I define a sequence of incomplete information games,

called inference games, which are indexed by a quantity of (public) observations

n < •. In each of these games, agents observe n random observations, and form

beliefs satisfying common inference. As the quantity of data n tends to infinity,

this sequence of games (almost surely) converges to the game in which agents have

common certainty of the true parameter value. But for any n < •, agents have

different beliefs.

The main part of the paper (Sections 5 and 6) asks: Which solutions of the

limit complete information game persist (with high probability) in these finite-data

inference games? The key object of study is pn(a), the probability that an action

profile a is a solution given n observations. Section 5 characterizes which solutions

have the property that pn(a) ! 1 as n tends to infinity; these solutions are said

to be robust to inference. I find that Nash equilibria are robust to inference if and

5The property of uniform consistency is satisfied by many families of inference rules, includingany finite inference rule class, as well as certain classes of kernel density estimators with variablebandwidths, and certain classes of Bayes estimators with heterogeneous priors.

6This assumes implicitly that the unknown parameter can be identified in the data. In theproposed framework, disagreement may persist even given infinite quantities of data if the parameteris not identified.

7Recent papers have argued that agreement need not occur even in an infinite data limit. Forexample, Acemoglu et al. (2015) show that asymptotic beliefs need not agree when individuals areuncertain about signal distributions. I assume agreement given infinite data in the main part of thepaper to emphasize the question of when (sufficient) agreement occurs given finite data. In Section7.1, I show that this is a stronger assumption than is necessary for the main results.

5

only if they are strict (Theorem 1), and that the robustness of rationalizable actions

can be characterized using procedures of iterated elimination of strategies that are

never a strict best reply (Theorem 2).

In practice, agents only observe restricted amounts of data. Thus, strategic

behavior in the limit (as the quantity of data grows arbitrarily large) may not be the

most appropriate criterion for predictions in real economic environments. I suggest

next that we can provide a measure for how robust a solution is by looking at how

much data is required to support the solution. In Section 6, I provide lower bounds

on pn(a) for Nash equilibria (Proposition 2) and rationalizable actions (Proposition

3). For both solution concepts, the quantity of data required depends on several

features:

First, it depends on a cardinal measure of strictness of the solution. Say that an

action profile is a d-strict NE if each agent’s prescribed action is at least d better

than his next best action; and say that an action profile is d-strict rationalizable

if it can be rationalized by a chain of best responses, in which each action yields

at least d over the next best alternative. This parameter d turns out to determine

how much estimation error the solution can withstand—the higher the degree of

strictness (the larger the parameter d), the less data agents need to see.

Second, the quantity of data required depends on the “diversity" of the inference

rules. When agents have common knowledge of a smaller set of inference rules,

or when these inference rules diverge less in their predictions given common

data, then fewer observations are needed to coordinate beliefs. Conversely, lack

of common knowledge over what constitutes a reasonable interpretation of data

serves to prolong disagreement. Thus, this criterion provides a formal sense in

which the common prior assumption is less appropriate for predicting strategic

interactions across cultures, nations, and industries.

6

Finally, the quantity of data required depends on the “complexity" of the

learning problem. I do not provide a universal notion of complexity; instead, the

relevant determinants are seen to vary with the choice of inference rules. For

many classes of inference rules, an important determinant is dimensionality, and I

provide several concrete examples to illustrate this. In these cases, predictions in

the limit complete information game are less robust when payoffs are a function of

a greater number of covariates.

These comparative statics are, in my view, a key advantage to modeling beliefs

using the proposed framework. When agents learn from data, possibly using

different inference rules, then channels for disagreement emerge that are comple-

mentary to (and distinct from) the traditional channel of differential information.

In particular, the amount of common knowledge over how to interpret data, and

the “dimensionality" or “complexity" of the unknown parameter, are both cru-

cial to determining dispersion in beliefs. These sources for disagreement have

potentially new implications for policy design and informational design: for exam-

ple, summary statistics may facilitate coordination by reducing the complexity of

information (and thus, coordinating beliefs).

The final sections proceed as follows: Section 7 examines several modeling

choices made in the main text, and discusses the extent to which the main results

rely on these choices. In particular, I look at relaxations of uniform consistency

(Section 7.1), the introduction of private data (Section 7.2), and the introduction of

limit uncertainty (Section 7.3).

Section 8 surveys the related literature. This paper builds a bridge between

the body of work that studies the robustness of equilibrium and equilibrium

refinements (Fudenberg et al., 1988; Carlsson and van Damme, 1993; Kajii and

Morris, 1997; Weinstein and Yildiz, 2007), and the body of work that studies the

7

asymptotic properties of learning from data (Cripps et al., 2008; Al-Najjar, 2009;

Acemoglu et al., 2015).

Section 9 concludes.

1.2 Example

I begin by illustrating ideas with a simple coordination game, in which two farmers

decide simultaneously whether or not to adopt a new agricultural technology—for

example, a new pest-resistant grain. Continued production of the existing grain

results in a payoff of 12 . Transitioning alone to the new grain results in a payoff of

0, since neither farmer can individually finance the distribution and transportation

costs of this new grain. Finally, coordinated transition results in an unknown payoff

of q, which I will assume for simplicity takes the value q = 1 if crop yield is high,

and q = �1 if crop yield is low. The payoffs to this game are summarized below:

Adopt Not Adopt

Adopt q, q 0, 12

Not Adopt 12 , 0 1

212

When should we predict coordination on adoption of the new grain?

In the standard approach, all uncertainty about the new grain is described by a

state space W, and we assume that agents share a common prior over W. In the

absence of any private information about the new grain, this approach implies that

the two farmers have an identical belief over its future yield. But predicting yields

of a new kind of crop is not easy: crop yield is a function of many environmental

conditions—the soil structure, light exposure, quantity of rain, etc. I propose

an alternative perspective for modeling their beliefs to capture the role that this

complexity may play in determining disagreement between agents.

8

Learning from Data. In the proposed model, farmers predict the future yield

of the new crop based on how it previously fared in other environments. There

are r < • relevant environmental conditions (soil structure, light,. . . ). For this

example, let us assume that each condition takes a value in the interval [�c, c], and

that the true relationship between environmental conditions in [�c, c]r and crop

yields (high or low) is given by the following deterministic function:

p(x) =

8

>

<

>

:

High if x 2 [�c0, c0]r

Low otherwise8 x 2 [�c, c]r

where c0 2 (0, c). That is, crop yields are high under conditions in [�c0, c0]r, and

low otherwise. (See the figure below for an illustration of this relationship with

r = 2.)

high yield

low yield

light exposure

soil structure

c0

c

c0 c

Figure 1.1: Two relevant attributes (r = 2). Yield is high under environmental conditions in [�c0, c0]2, andlow otherwise. Farmers do not know the high yield region (shaded).

It is common knowledge that there is a (hyper-)rectangular region of favorable

environmental conditions (high yield), and a remaining region of unfavorable

conditions. The farmers do not, however, know the exact regions. Instead, they

9

observe the common data

(x1, p(x1)), . . . , (xn, p(xn)),

where xi are identically and independently drawn from a uniform distribution on

[�c, c]r. That is, farmers observe crop yields in n different sampled environments.

From this data, farmers infer a partitioning p that correctly classifies every obser-

vation, and use this inferred relationship to predict whether the yield will be low

or high in their region. For simplicity, let us take this region to be described by the

origin.

The key observation is that many rectangular partitionings will perfectly fit the

data; some of which may have different predictions at the origin. This creates room

for potential (rational) disagreement. (Figure 1.2 illustrates two such partitionings

based on an example dataset.) Say that a strategic prediction is robust if it holds

without further assumption regarding which partitioning either farmer infers, or

which partitioning he believes the other farmer to infer. When is coordinated

adoption robust?

low yield

high yield

?

c0

c

c0 c

light exposure

soil structure

Figure 1.2: Circles indicate low yields, and squares indicate high yields. The two rectangles identifypartitionings (predict high yield if x is contained within the rectangle, and low yield otherwise) that exactly fitthe data.

10

Robustness. Let us first clarify this criterion as follows. Every realization of

the data pins down a set of predictions, each of which is consistent with some

rectangular partitioning that exactly fits the data. Now, suppose only that this

set of predictions is common certainty—that is, both farmers put probability 1

on this set of predictions, believe that the other puts probability 1 on this set of

predictions, and so forth.8 This defines a set of possible (hierarchies of) beliefs that

either farmer could hold.

The key object of interest will be the probability that data is realized such that

Adopt is rationalizable given any belief in this set. This probability is a function of

the quantity of data n and of the number of conditions r; I will write it as p(n, r),

and refer to it as the plausibility of coordination on adoption.

Claim 1 For every quantity of data n � 1, number of environmental conditions r � 1,

and constants c, c0 2 R+

,

p(n, r) =✓

1 �

2✓

2c � c0

2c

◆n�✓

c � c0

c

◆n�◆r

.

Proof 1 See appendix.

This claim has several implications.

Observation 1 Coordinated adoption is more plausible when the quantity of data n is

larger.

8 Formally, define P to be the family of functions

p(x) = p⇣

x1, . . . , xr⌘

=

⇢

1 if xk 2 [ck, ck] for every k = 1, . . . , r0 otherwise 8 x 2 X ,

parametrized by the tuple (c1, c1, . . . , cr, cr) 2 R2r. This defines the class of all axis-aligned hyper-rectangles. Agents have common certainty in the set

{p(0) : p 2 P and p(xk) = p(xk) 8 k = 1, . . . n}.

11

From Claim 1, we see that p(n, r) is increasing in n. Indeed, p(n, r) ! 1 as

the quantity of data n tends to infinity (for fixed r). Thus, if farmers observe crop

yields in sufficiently many different environments, then coordinated adoption is

arbitrarily plausible.

Observation 2 Coordinated adoption is less plausible when the number of environmental

conditions r is larger.

From Claim 1, we see that p(n, r) is decreasing in r when n is sufficiently large.9

In fact, p(n, r) ! 0 as the number of environmental conditions r tends to infinity

(for fixed n). This suggests that coordinated adoption is less plausible when crop

yield depends purely on a single environmental condition, than when it depends

on a high-dimensional set of covariates.

Observation 3 Coordinated adoption is less plausible when the set of inference rules is

larger.

The probability p(n, r) weakly decreases for every n and r as we expand the

set of possible interpretations of the data. For example, suppose we assume that

it is common knowledge that the region of high crop yields is described by a

rotated rectangle, instead of axis-aligned rectangles as assumed above. This weakly

expands the room for possible disagreement, since there are datasets such as the

one in the figure below where every axis-aligned rectangle partitioning predicts

high yield at the origin, but some rotated rectangle partitioning predicts low yield.10

9A sufficient condition is

2✓

2c � c02c

◆n�✓

c � c0c

◆n<

1r

.

10Since the set of rotated rectangle partitionings includes the set of axis-aligned rectangle parti-tionings, clearly if every partitioning in the former set predicts that rebellion will be successful, thenevery partitioning in the latter set will as well.

12

high yield

low yield

c0

c

c0 c

Figure 1.3: Every axis-aligned rectangle partitioning predicts high yield at the origin, but there exists arotated rectangle partitioning that predicts low yield.

This suggests that coordinated adoption is more plausible when extrapolation from

past crop yields is coordinated by external means—for example, a common culture,

or a common set of heuristics.

Takeaways. Under the proposed approach, prediction of coordinated adoption

of the new grain is more plausible when agents have previously observed few

trial instances of the new crop, when the determinants of crop yield are high-

dimensional, and when there is not a common approach to extrapolating from

past yields. In the main body of the paper, I generalize the ideas in this example,

proposing a model in which agents have common certainty in the predictions of

an arbitrary class of inference rules, and a robustness criterion for equilibria and

rationalizable actions in all finite normal-form games.

1.3 Preliminaries and Notation

1.3.1 The game

Consider a set I of I < • agents and finite action sets (Ai)i2I . As usual, let

A = ’i2I Ai. The set of complete information (normal-form) games defined on

13

these primitives is the set of payoff matrices in U := R|I|⇥|A|. Let Q ✓ Rk be a

compact set of parameters and fix a metric d0 on Q such that (Q, d0) is complete

and separable. I will identify these parameters with payoff matrices under a map g

satisfying:

Assumption 1 g : Q ! U is a bounded and Lipschitz continuous embedding.11

This map g can be interpreted as capturing the known information about the

structure of payoffs. For example, in the game presented in Section 2, players know

that payoffs belonged to the parametric family of payoffs

a1 a2

a1 q, q 0, 12

a212 , 0 1

2 , 12

but do not know the value of q. Notice that identifying payoffs with parameters in

this way is without loss of generality, since we can always take Q := U and set g

to be the identity map. For clarity of exposition, I will sometimes write u(a, q) for

g(q)(a), or ui(a, q) for the payoffs to agent i. Finally, denote the true value of the

parameter by q⇤, and suppose that it is unknown.

Remark 1 It is also possible to interpret q as indexing a family of distributions over

payoffs; for example, q may be the mean of a normal distribution with a fixed variance. In

this case, g maps parameters in Q to expected payoffs under the distribution determined

by q.

1.3.2 Beliefs

Now let us define beliefs on Q.

11A map is an embedding if it is a homeomorphism onto its image.

14

Type space. For notational simplicity, consider first I = 2. Following Branden-

burger and Dekel (1993), recursively define

X0 = Q

X1 = X0 ⇥ (

D(X0))

...

Xn = Xn�1 ⇥ (

D(Xn�1))

and take T0 = ’•n=0 D(Xn). An element (t1, t2, . . . ) 2 T0 is a complete description

of beliefs over Q (describing the agent’s uncertainty over Q, his uncertainty over

his opponents’ uncertainty over Q, and so forth), and is referred to as a type.

This approach can be generalized for I agents, taking X0 = Q, X1 = X0 ⇥(

D(X0))I�1, and building up in this way. Mertens and Zamir (1985) have shown

that for every agent i, there is a subset of types T⇤i (that satisfy the property of

coherency12) and a function k⇤i : T⇤i ! D

�

Q ⇥ T⇤�i�

such that ki(ti) preserves the

beliefs in ti; that is, margXn�1ki(ti) = tn

i for every n. Notice that T⇤�i is used here to

denote the set of profiles of opponent types.

The tuple (T⇤i , k⇤i )i2I is known as the universal type space. Other tuples (Ti, ki)i2I

with Ti ✓ T⇤i for every i, and ki : Ti ! D(Q ⇥ T�i) represent alternative (smaller)

type spaces. Finally, let T⇤= T⇤

1 ⇥ · · ·⇥ T⇤I denote the set of all type profiles, with

typical element t = (t1, . . . , tI).

Remark 2 Types are sometimes modeled as encompassing all uncertainty in the game.

In this paper, I separate strategic uncertainty over opponent actions from structural

uncertainty over Q.

Topology on types. Let Tki = D(Xk�1) = D(Q ⇥ Tk�1

�i ) denote the set of possible

12margXn�2tn

= tn�1, so that (t1, t2, . . . ) is a consistent stochastic process.

15

k-th order beliefs for agent i.13 The uniform-weak topology on T⇤i , proposed in Chen

et al. (2010), is the metric topology generated by the distance

dUWi�

ti, t0i�

= supk�1

dk �ti, t0i� 8 ti, t0i 2 T⇤

i ,

where d0 is the metric defined on Q (see Section 2.1)14 and recursively for k � 1, dk

is the Prokhorov distance15 on D⇣

Q ⇥ Tk�1�i

⌘

induced by the metric max{d0, dk�1}on Q ⇥ Tk�1

i .

Common p-belief. Define W = Q ⇥ T⇤ to be the set of all “states of the world,"

such that every element in W corresponds to a complete description of uncertainty.

Following Monderer and Samet (1989), for every E ✓ W, let

Bp(E) := {(q, t) : ki(ti)(E) � p for every i} , (1.1)

describe the event in which every agent believes E ✓ W with probability at least p.

Common p-belief in the set E is given by

C p(E) :=

\

k�1[

Bp]

k(E).

The special case of common 1-belief is referred to in this paper as common certainty.

I use in particular the concept of common certainty in a set of first-order beliefs.

13Working only with types in the universal type space, it is possible to identify each Xk with itsfirst and last coordinates, since all intermediate information is redundant.

14In Chen et al. (2010), Q is finite and d0 is the discrete metric, but this construction extends to allcomplete and separable (Q, d0

).

15Recall that the Levy-Prokhorov distance r between measures on metric space (X, d) is defined

r(µ, µ0) = inf

n

d > 0 : µ(E) µ0 ⇣Ed⌘

+ d for each measurable E ✓ Xo

for all µ, µ0 2 D(X), where Ed= {x 2 X : infx02E d(x, x0) < d}.

16

For any F ✓ D(Q), define the event

EF :=�

(q, t) : margQ ti 2 F for every i

, (1.2)

in which every agent’s first-order belief is in F. Then, C1(EF) is the event in which

it is common certainty that every agent has a first-order belief in F. The set of

types ti given which agent i believes that F is common certainty is the projection of

C1(EF) onto T⇤

i .16 Since this set is identical across agents, I will refer to this simply

as the set of types with common certainty in F.

1.3.3 Solution concepts

Two solution concepts for incomplete information games are used in this paper.

Interim Correlated Rationalizability (Dekel et al., 2007). For every agent i and type

ti, set S0i [ti] = Ai, and define Sk

i [ti] for k � 1 such that ai 2 Ski [ti] if and only if ai 2

BRi

⇣

margQ⇥A�ip⌘

for some p 2 D(Q ⇥ T�i ⇥ A�i) satisfying (1) margQ⇥T�ip =

ki(ti) and (2) p⇣

a�i 2 Sk�1�i [t�i]

⌘

= 1, where Sk�1�i [t�i] = ’j 6=i Sk�1

j [t�j]. We can

interpret p to be an extension of belief ki(ti) onto the space D(Q ⇥ T�i ⇥ A�i), with

support in the set of actions that survive k � 1 rounds of iterated elimination of

strictly dominated strategies for types in T�i. For every i, define

S•i [ti] =

•\

k=0Sk

i [ti]

to be the set of actions that are interim correlated rationalizable for agent i of type

ti, or (henceforth) simply rationalizable.

Interim Bayesian Nash equilibrium. Fix any type space (Ti, ki)i2I . A strategy for

16Notice that when beliefs are allowed to be wrong (as they are in this approach), individualperception of common certainty is the relevant object of study. That is, agent i can believe that a set offirst-order beliefs is common certainty, even if no other agent in fact has a first-order belief in this set.Conversely, even if every agent indeed has a first-order belief in F, agent i may believe that no otheragent has a first-order belief in this set.

17

player i is a measurable function si : Ti ! Ai. The strategy profile (s1, . . . , sI) is a

Bayesian Nash equilibrium if

si(ti) 2 argmaxa2Ai

Z

Q⇥T�i

ui(ai, s�i(t�i), q)dki(ti) for every i 2 I and ti 2 Ti.

In a slight abuse of terminology, I will say throughout that action profile a is an

(interim) Bayesian Nash equilibrium if the strategy s with si(ti) = ai for every

ti 2 Ti is a Bayesian Nash equilibrium.

1.4 Learning from Data

What are agent beliefs over the unknown parameter (and over opponent beliefs

over the unknown parameter), and how are they formed? In this section, I describe

a framework in which beliefs are formed by learning from data.

Say that a dataset is any sequence of n observations z1, . . . , zn, sampled i.i.d.

from a set Z according to an exogenous distribution P. Throughout, I use Zn to

denote the random sequence corresponding to n observations, and zn to denote

a typical realization (or simply Z and z when the number of observations is not

important).

The key assumption of my approach is a restriction on the possible types that

agents can have following rationalization of the data. I begin by introducing a few

relevant concepts. Define a inference rule to be any map µ : z 7! µz from the set

of possible datasets17 to D(Q), the set of (Borel) probability measures on Q. Fix a

family of inference rules M.

Definition 1 For every dataset z, say that

Fz = {µz : µ 2 M} ✓ D(Q)

17[n�1Zn, where Zn denotes the n-fold Cartesian product of the set Z .

18

is the set of plausible first-order beliefs.

This is the set of all distributions over Q that emerge from evaluating the dataset

z using an inference rule in M. For every dataset z, define Tz to be the set of all

(interim) types for whom Fz is common certainty.18 That is, every type in Tz has

a plausible first-order belief, puts probability 1 on every other agent having a

plausible first-order belief, and so forth. The main restriction below, which I will

refer to from now on as common inference, assumes that following realization of

data z, every agent has a type in Tz.19

Assumption 2 (Common inference) Given any dataset z, every agent i has an (in-

terim) type ti 2 Tz.20

Several special examples for the set of inference rules M are collected below.

Example 1 (Bayesian updating with a common prior) Define µ to be the map that

takes any dataset z into the Bayesian posterior induced from the common prior and a

common likelihood function. Let M = {µ}. Then, for every dataset z, the set Fz consists of

the singleton Bayesian posterior induced from the common prior, and the set Tz consists of

the singleton type with common certainty in this Bayesian posterior.21

Example 2 (Bayesian updating with uncommon priors) Fix a set of prior distribu-

tions on Q and a common likelihood function. Every inference rule µ 2 M is identified

18See the end of Section 3.2 for a more formal exposition.

19The results in this paper follow without modification if we relax this assumption to commoncertainty in the convex hull of distributions in Fz. See Lemma 5.

20Notice that this paper takes an unusual interpretation of the ex-ante/interim distinction, whichdoes not explicitly invoke a Bayesian perspective. In this paper, the role of the prior is replaced by adata-generating process.

21That is, his first-order beliefs are given by this posterior, and he believes with probability 1 thathis opponents’ first-order beliefs are given by this posterior, and so forth.

19

with a prior distribution in this set, and maps the observed data to the Bayesian posterior

induced from this prior and the common likelihood.

Example 3 (Kernel regression with different bandwidth sequences) Let X ✓ R

be a set of attributes, which are related to outcomes in Q under the unknown map f : X !Q. Data is of form

(x1, y1), . . . , (xn, yn),

where every xk 2 X and every yk = f (xk). Suppose that the unknown parameter q⇤ is the

value of the function f evaluated at input x0.

Inference rules in M map the data to an estimate for q⇤ by first producing an estimated

function f , and then evaluating this function at x0. The approach for estimating f is as

follows: Fix a kernel function22 K : Rd ! R, and let hn ! 0 be a sequence of constants

tending to zero. Define fn,h : X ! Q to be the Nadaraya-Watson estimator

fn,h(x) =(nhn)

�1yk Ânk=1 K

�

(x � xk)/h1/dn�

(nhn)�1 Ânk=1 K

⇣

(x � xk)/h1/dn

⌘ ,

which produces estimates by taking a weighted average of nearby observations. This

describes an individual inference rule µ.

Now let H be a set of (bandwidth) sequences, each of which determines a different

level of “smoothing" applied to the data. Every inference rule µ 2 M is identified with a

sequence hn 2 H. Thus, M is a set of kernel regression estimators with different bandwidth

sequences.

Remark 3 Common inference does not impose an explicit relationship between agent beliefs

22K is measurable and satisfies the conditionsZ

RdK(x)dx = 1

supx2Rd

kK(x)k = k < •

20

and estimators. For example, all of the following are consistent with common inference:

• Every agent i is identified with an inference rule µi 2 M, and the sequence of

inference rules (µi)i2I is common knowledge.

• Every agent i is identified with an inference rule µi 2 M. Agent i knows his own

inference rule µi, but has a nondegenerate belief distribution over the inference rules

of other agents.

• Every agent i is identified with a distribution Pi on M, and draws an inference rule

at random from M under this distribution.

In the main part of this paper, I assume common inference (Assumption 2),

and ask what properties of beliefs and strategic behavior can be deduced from this

assumption alone.

1.4.1 When do agents commonly learn?

Let us begin by considering the property of common learning. Say that agents

commonly learn the true parameter q⇤ if, as the quantity of data increases, every

agent believes that it is approximate common certainty that the value of the

parameter is close to q⇤. It will be useful in this section to assign to every agent i a

map ti : z 7! tiz, which takes the realized data into a type in Tz.

Definition 2 (Common Learning) Agents commonly learn q⇤ if

limn!•

Pn⇣n

zn : tizn 2 C p

(Be(q⇤))

o⌘

= 1 8 i

for every p 2 [0, 1) and e > 0.

That is, for every level of confidence p and precision e, every agent eventually

lbelieves that the e-ball around the true parameter q⇤ is common p-belief. This

21

definition is modified from Cripps et al. (2008).23 When does common inference

imply that agents commonly learn the true parameter q⇤?

The following property of families of inference rules M will be useful:

Definition 3 (Uniform consistency.) The family of inference rules M is q⇤-uniformly

consistent if

supµ2M

dP(µZn , dq⇤) ! 0 a.s.

where dP is the Prokhorov metric on D(Q).

Remark 4 Say that an individual inference rule µ is q⇤-consistent if dP(µZn , dq⇤) ! 0

a.s. Uniform consistency is immediately satisfied by any finite family of q⇤-consistent

inference rules.

Recalling that dP metrizes the topology of weak convergence of measures, this

property says that for every µ 2 M, the distribution µZn (almost surely) weakly

converges to a degenerate distribution on q⇤. Moreover, this convergence is uniform

in inference rules.

Proposition 1 Every agent commonly learns the true parameter q⇤ if and only if M is

q⇤-uniformly consistent.

The structure of the argument is as follows. From Chen et al. (2010), we

know that convergence in the uniform-weak topology is equivalent to approximate

common certainty in the true parameter. Under q⇤-uniform consistency, it can be

shown that with probability 1, every sequence of types from {TZn}n�1 converges

in the uniform-weak topology to the type with common certainty in q⇤. This is,

23I take e > 0, so that agents believe it is approximate common certainty that the parameter isclose to q⇤; in Cripps et al. (2008), Q is finite, so agents believe it is approximate common certaintythat the parameter is exactly q⇤.

22

loosely, because possible k-th order beliefs are restricted to have support in the

possible (k � 1)-th order beliefs, so that in fact the rate of convergence of first-order

beliefs is a uniform upper-bound on the rate of convergence of beliefs at every

order. The details of this proof can be found in the appendix.

I assume in the remainder of the paper that M is q⇤-uniformly consistent, so

that beliefs converge as the quantity of data tends to infinity. The next part of the

paper transitions the focus to the stronger property of convergence of solution sets.

1.5 Robustness to Inference

Suppose that action ai is rationalizable for agent i (or, action profile a is an equilib-

rium) in the complete information game in which the true parameter q⇤ is common

certainty. Can we guarantee that action ai (action profile a) remains rationalizable

(an equilibrium) when payoffs are inferred from data, so long as agents observe a

sufficiently large number of observations?

1.5.1 Concepts

I will first introduce the idea of an inference game. For any dataset z, define G(z)

to be the incomplete information game with primitives I , (Ai)i2I , Q, g, and type

space

Tz = (Tzi , kz

i )i2I ,

where Tzi = Tz for every i, and kz

i is the restriction of ki (as defined in Section 3.2)

to Tzi .24 Notice that if Tz consists only of the type with common certainty of q⇤,

then this game reduces to the complete information game with payoffs given by

g(q⇤).

24Notice that kzi (T

zi ) = Tz

�i for every agent i, so this is a belief-closed type space.

23

We can interpret inference games as follows. Suppose the analyst knows only

that agents have observed data z, and that Assumption 2 (Common Inference)

holds. Then, the set of types that any player i may have is given by Tz. Recall

that as the quantity of data tends to infinity, this set Tz converges almost surely to

the singleton type with common certainty in q⇤. So, for large quantities of data,

inference games approximate the (true) complete information game. The question

of interest is with what probability solutions in this limit game persist in finite-data

inference games. This question is made precise for the solution concepts of Nash

equilibrium and rationalizability in the following way.

For any Nash equilibrium a of the limit complete information game, define

pNEn (a) to be the probability that data zn is realized, such that the strategy profile

(si)i2I , with si(ti) = ai 8 i 2 I , ti 2 Tzni

is a Bayesian Nash equilibrium. Analogously, define pRn (i, ai) to be the probability

that data zn is realized, such that

ai 2 S•i [ti] 8 ti 2 Tzn

i ;

that is, ai is rationalizable for agent i given any type in Tzni .

Definition 4 The rationalizability of action ai for player i is robust to inference if

pRn (i, ai) ! 1 as n ! •.

The equilibrium property of action profile a is robust to inference if

pNEn (a) ! 1 as n ! •.

What is the significance of robustness to inference? Suppose that action ai is

rationalizable when the true parameter is common certainty, and suppose moreover

24

that this property of ai is robust to inference. Then, the analyst believes with high

probability that ai is rationalizable for all types in the realized inference game, so

long as the quantity of observed data is sufficiently large. Conversely, suppose

that ai is rationalizable when the true parameter is common certainty, but that this

property of ai is not robust to inference. Then, there exists a constant d > 0 such

that for any finite quantity of data, the probability that ai fails to be rationalizable

for some type in the realized inference game is at least d. In this way, robustness to

inference is a minimal requirement for the rationalizability of ai to persist when

agents infer their payoffs from data. Analogous statements apply when we replace

rationalizability with equilibrium.

Let us first consider two trivial examples in which robustness to inference

imposes no restrictions. Consider the game with payoff matrix

a1 a2

a1 q⇤, q⇤ 0, 0

a2 0, 0 12 , 1

2

where q⇤ > 0. Is the equilibrium (a1, a1) robust to inference?

Example 4 (Trivial inference.) Let M consist of the singleton inference rule µ satisfying

µz = dq⇤ 8 z,

so that µz is always degenerate on the true value q⇤. Then, the set of plausible first-order

beliefs is Fz = {dq⇤} for every z, so that the true parameter q⇤ is common certainty with

probability 1. Thus, the inference game G(z) reduces to a complete information game, and

the equilibrium property of (a1, a1) is trivially robust to inference.

Example 5 (Unnecessary inference.) Let Q := [0, •). Then, action profile (a1, a1) is

a Nash equilibrium for every possible value of q 2 Q. Thus, the strategy profile that maps

25

any type of either player to the action a1 is a Bayesian Nash equilibrium for any beliefs that

players might hold over Q. In this way, the family of inference rules M is irrelevant, and

(a1, a1) is again trivially robust to inference.

The two following conditions rule out these cases in which inference is either trivial

or unnecessary.

Assumption 3 (Nontrivial Inference.) There exists a constant g > 0 such that

Pn({zn : dq⇤ 2 Int(Fzn)}) > g.

for every n sufficiently large.

This property says that for sufficient quantities of data, the probability that dq⇤

is contained in the interior of the set of plausible first-order beliefs Fzn is bounded

away from 0. Assumption 3 rules out the example of trivial inference, as well

as related examples in which every inference rule in M overestimates, or every

inference rule underestimates, the unknown parameter.25

To rule out the second example, I impose a richness condition on the image

of g. For every agent i and action ai 2 Ai, define S(i, ai) to be the set of complete

information games in which ai is a strictly dominant strategy for agent i; that is,

S(i, ai) :=�

u0 2 U : u0i(ai, a�i) > u0

i(a0i, a�i) 8 a0i 6= ai and 8 a�i

.

Assumption 4 (Richness.) For every i 2 I and ai 2 Ai, g(Q) \ S(i, ai) 6= ∆.

Under this restriction, which is also assumed in Carlsson and van Damme

(1993) and Weinstein and Yildiz (2007), every action is strictly dominant at some

parameter value. This condition is trivially satisfied if Q = U.

25This does not rule out sets of biased estimators. It may be that in expectation, every inferencerule in M overestimates the true parameter; Assumption 3 requires that underestimation occurs withprobability bounded away from 0.

26

In the subsequent analysis, I assume that the family of inference rules M

satisfies nontrivial inference, and the map g satisfies richness. These conditions are

abbreviated to NI and R, respectively.

1.5.2 Bayesian Nash Equilibrium

When is the equilibrium property of an action profile robust to inference? (From

now on, I will abbreviate this to saying that the action profile is itself robust to

inference.)

Theorem 1 Assume NI and R. Then, the equilibrium property of action profile a⇤ is robust

to inference if and only if it is a strict Nash equilibrium.

The intuition for the proof is as follows. Define UNEa⇤ to be the set of all payoffs u

such that a⇤ is a Nash equilibrium in the complete information game with payoffs

u. The interior of UNEa⇤ is exactly the set of payoffs u with the property that a⇤

is a strict Nash equilibrium given these payoffs. I show that as the quantity of

data tends to infinity, agents (almost surely) have common certainty in a shrinking

neighborhood of the true payoffs, so it follows that a⇤ is robust to inference if and

only if the true payoff function u⇤= g(q⇤) lies in the interior of UNE

a⇤ .

Proof 2 First, I show that the interior of the set UNEa⇤ is characterized by the set of complete

information games in which a⇤ is a strict Nash equilibrium.

Lemma 1 u 2 Int�

UNEa⇤�

if and only if action profile a⇤ is a strict Nash equilibrium in

the complete information game with payoffs u.

Proof 3 Suppose a⇤ is not a strict Nash equilibrium in the complete information game

with payoffs u. Then, there is some agent i and action ai 6= a⇤i such that

ui(ai, a⇤�i) � ui(a⇤i , a⇤�i).

27

Define ue such that uei (ai, a⇤�i) = ui(ai, a⇤�i) + e, and otherwise ue

i agrees with ui. Then,

ue 2 Be(u) for every e > 0, but ai is a strictly profitable deviation for agent i in response

to a⇤�i in the game with payoffs uei . So a⇤ is not an equilibrium in this game. Fix any

sequence of positive constants en ! 0. Then, uen ! u as n ! •, but uen /2 UNEa⇤ for

every n, so it follows that u /2 Int(UNEa⇤ ) as desired.

Now suppose that a⇤ is a strict Nash equilibrium in the complete information game

with payoffs u. Then,

e⇤ := infi2I

✓

ui(a⇤i , a⇤�i)� maxai 6=a⇤i

ui(ai, a⇤�i)

◆

> 0,

so u 2 Be⇤(u) ✓ UNEa⇤ , with Be⇤ nonempty and open. It follows that u 2 Int

�

UNEa⇤�

, as

desired.

Next, I show that a⇤ is robust to inference if and only if the true payoff function is in

the interior of the set UNEa⇤ .

Lemma 2 Let u⇤= g(q⇤). The equilibrium property of action profile a⇤ is robust to

inference if and only if u⇤ 2 Int�

UNEa⇤�

.

Proof 4 Define h(µ) =R

Q g(q)dµ to be the map from (first-order) beliefs µ 2 D(Q) into

the expected payoff function under µ.

h(μ) = ∫g(θ)dμ

μ

Δ(Θ)

u

U

beliefs on Θ (first-order) complete information games

Fz h(Fz)

Figure 1.4: The map h takes first-order beliefs µ into expected payoff functions.

Recall that every dataset z induces a set of plausible first-order beliefs Fz. The following

28

claim says that the equilibrium property of a⇤ is robust to inference if and only if with high

probability the set of expected payoffs h(FZn) is contained within UNEa⇤ as n ! •.

Claim 2 The equilibrium property of a⇤ is robust to inference if and only if

Pn⇣n

zn : h(Fzn) ✓ UNEa⇤o⌘

! 1 as n ! •.

Proof 5 I will show that the strategy profile (si)i2I with

si(ti) = a⇤i 8 i 2 I , 8 ti 2 Tz

is a Bayesian Nash equilibrium if and only if h(Fz) ✓ UNEa⇤ . From this, the above claim

follows immediately.

Suppose h(Fz) ✓ UNEa⇤ . Then, for any payoff function u 2 h(Fz),

ui(a⇤i , a⇤�i) � ui(ai, a⇤�i) 8 i 2 I and ai 6= a⇤i . (1.3)

Fix an arbitrary agent i and type ti with common certainty in h(Fz), and define µi :=

margQ ti to be his first-order belief. By construction, µi assigns probability 1 to h(Fz), so

it follows from (1.3) that

Z

Uui(a⇤i , a⇤�i)g. ⇤(µi) �

Z

Uui(ai, a⇤�i)g. ⇤(µi) 8 ai 6= a⇤i ,

where g⇤(µ) denotes the pushforward measure of µ under mapping g. Repeating this

argument for all agents and all types with common certainty in h(Fz), it follows that

(si)i2I is indeed a Bayesian Nash equilibrium.

Now suppose to the contrary that h(Fz)UNEa⇤ and consider any payoff function u that is

in h(Fz) but not in UNEa⇤ . Then, there exists some agent i for whom

ui(a⇤i , a⇤�i)� maxai 6=a⇤i

ui(ai, a⇤�i) < 0.

Let ti be the type with common certainty in g�1(u). Then, agent i of type ti has a profitable

29

deviation to some ai 6= a⇤i , so (si)i2I is not a Bayesian Nash equilibrium.

The final claim says that

Pn⇣n

zn : h(Fzn) ✓ UNEa⇤o⌘

! 1 as n ! •

if and only if u⇤ is in the interior of the set UNEa⇤ . This is, loosely, because h(Fzn) converges

to the singleton set {u⇤}; its proof is deferred to the appendix.

Claim 3 limn!• Pn ��zn : h(Fzn) ✓ UNEa⇤ �

= 1 if and only if u⇤ 2 Int(UNEa⇤ ).

The theorem directly follows from Lemmas 1 and 2.

1.5.3 Rationalizable Actions

When is the property of rationalizability of an action robust to inference? Theorem

1 suggests that the corresponding condition is strict rationalizability in the limit

complete information game. This intuition is roughly correct, but subtleties in the

procedure of elimination are relevant, and the theorem below will rely on two

different such procedures.

First, recall the usual definition for strict rationalizability, introduced in Dekel

et al. (2006). For every agent i and type ti, set R1i (ti) = Ai. Then, recursively define

Rki (ti), for every k � 2, such that ai 2 Rk

i [ti] if and only if

Z

Q⇥T�i⇥A�i

�

ui(ai, a�i, q)� ui(a0i, a�i, q)�

p. > 0 8 a0i 6= ai (1.4)

for some distribution p 2 D(Q ⇥ T�i ⇥ A�i) satisfying (1) margQ⇥T�ip = kti ,

and (2) p⇣

a�i 2 Rk�1�i [t�i]

⌘

= 1. That is, an action survives the k-th round of

elimination only if it is a strict best response to some distribution over opponent

strategies surviving the (k � 1)-th round of elimination. Let

R•i [ti] =

•\

k=0Rk

i [ti]

30

be the set of player i actions that survive every round of elimination. Define tq⇤ to

be the type with common certainty in the true parameter q⇤. I will say that action

ai is strongly strict-rationalizable if ai 2 R•i [tq⇤ ], where strongly is used to contrast

with the definition below.

Notice that in this definition, every action that is never a strict best response (to

surviving opponent strategies) is eliminated at once. This choice has consequences

for the surviving set, since elimination of strategies that are never a strict best

response is an order-dependent process. Following, I introduce a new procedure,

in which actions are eliminated (at most) one at a time.

For every agent i, let W1i := Ai. Then, for k � 2, recursively remove (at most)

one action in Wki that is not a strict best reply to any opponent strategy a�i with

support in Wk�1�i . That is, either the set difference Wk

i �Wk+1i is empty, or it consists

of a singleton action ai where there does not exist any a�i 2 D⇣

Wk�1�i

⌘

such that

ui(ai, a�i) > ui(a0i, a�i) 8 a0i 6= ai.

That is, ai is not a strict best reply to any distribution over surviving opponent

actions. Let

W•i =

\

k�1Wk

i

be the set of player i actions that survive every round of elimination, and say that

any set W•i constructed in this way survives an order of weak strict-rationalizability.

Define W•i to be the intersection of all sets W•

i surviving an order of weak strict-

rationalizability. I will say that an action ai is weakly strict-rationalizable if ai 2 W•i .26

Theorem 2 Assume NI and R. Then, the rationalizability of action a⇤i for agent i is

26The choice of weak to describe the latter procedure, and strong to describe the former, is explainedby Claim 6 (see Appendix B), which says that an action is strongly strict-rationalizable only if it isweakly strict-rationalizable.

31

robust to inference if a⇤i is strongly strict-rationalizable, and only if a⇤i is weakly strict-

rationalizable.

Remark 5 If there are two players, then the theorem above can be strengthened as follows:

Assume NI and R. Then, the rationalizability of action a⇤i for agent i is robust to inference

if and only if a⇤i is weakly strict-rationalizable.

Remark 6 The existence of actions that are strongly strict-rationalizable, but not weakly

strict-rationalizable, occurs only for a non-generic set of payoffs.27 See the discussion

preceding Figure 1.4 for a characterization of these intermediate cases.

Remark 7 Rationalizable actions that are robust to inference need not exist. For example,

in the degenerate game

a3 a4

a1 0, 0 0, 0

a2 0, 0 0, 0

all actions are rationalizable, but none are robust to inference.

Remark 8 Why is refinement obtained, in light of the results of Weinstein and Yildiz

(2007)? The key intuition is that the negative result in Weinstein and Yildiz (2007) relies on

construction of tail beliefs that put sufficient probability on payoff functions with dominant

actions. But under common inference, it is common certainty that every player puts low

probability on “most" payoff functions. So, with high probability, contagion from “far-off"

payoff functions with a dominant action cannot begin.

A second explanation for why refinement is obtained is the following. One can show

that the perturbations considered in this paper are a subset of perturbations in the uniform-

weak topology, which is finer than the topology used in Weinstein and Yildiz (2007). In

27The set of such payoffs is nowhere dense in the Euclidean topology on U.

32

particular, the sequences of types used to show failure of robustness in Weinstein and Yildiz

(2007) do not converge in the uniform-weak topology.

The broad structure of the proof follows that of Theorem 1, with several new

complications that I discuss below. Recall that as the quantity of data tends to

infinity, agents have common certainty in a (shrinking) neighborhood of the true

payoffs. Thus, a⇤i is robust to inference if and only if common certainty in a

sufficiently small neighborhood of the true payoffs u⇤ implies that the action a⇤i is

rationalizable for player i.

A necessary condition for robustness to inference. In analogy with the set UNEa⇤ ,

define URa⇤i

to be the set of all complete information games in which a⇤i is rational-

izable.28 Clearly, if u⇤ is on the boundary of this set, then common certainty in a

neighborhood of u⇤ (no matter how small) cannot guarantee rationalizability of a⇤i .

Therefore, a necessary condition for robustness to inference is that u⇤ must lie in

the interior of URa⇤i

. The first lemma says that the interior of URa⇤i

is characterized by

the set of actions that survive every process of weak strict-rationalizability.

Lemma 3 u 2 Int⇣

URa⇤i

⌘

if and only if a⇤i 2 W•i in the complete information game with

payoffs u.

Why is Int⇣

URa⇤i

⌘

characterized by this particular notion of strict rationalizability,

and not by others? I provide an example that illustrates why various other natural

candidates are not the right notion, and follow this with a brief intuition for the

proof of Lemma 3.

28Here I abuse notation and write URa⇤i

instead of URi,a⇤i

.

33

Consider the payoff matrices below:

(u1)

a3 a4

a1 1, 0 1, 0

a2 0, 0 0, 0

(u2)

a3 a4

a1 1, 0 1, 0

a2 0, 0 1, 0

If all strategies that are never a strict best reply are eliminated simultaneously

(corresponding to strong strict-rationalizablity), then a1 does not survive in either

game.29 If the criterion is survival of any process of iterated elimination of strategies

that are never a strict best reply, then a1 survives in both games.30 But u1 is in the

interior of URa1

, while u2 is not,31 so neither of these notions provides the desired

differentiation.

Now, I provide a brief intuition for the “only-if" direction of Lemma 3. Suppose

that action a⇤i fails to survive some iteration of weak strict-rationalizability. Then,

there is some sequence of sets (Wki )k�1 satisfying the recursive description in the

29In the first round, both actions are eliminated for player 2, so a1 trivially cannot be a best replyfor player 1 to any surviving player 2 action.

30For example, the order of elimination

a3 a4a1 1, 0 1, 0a2 0, 0 0, 0

�!a3 a4

a1 1, 0a2 0, 0

�!a3 a4

a1 1, 0a2

in the first game, and

a3 a4a1 1, 0 1, 0a2 0, 0 1, 0

�!a3 a4

a1 1, 0a2 0, 0

�!a3 a4

a1 1, 0a2

in the second.

31Action a1 remains rationalizable in every complete information game with payoffs close to u1,so u1 2 Int(UR

a1). In contrast, for arbitrary e � 0, the payoff matrix

(u02)

a3 a4a1 1, 0 1, ea2 0, 0 1 + e, e

is within e of u2 (in the sup-norm), but a1 is not rationalizable in the complete information gamewith payoffs u0

2. So, the payoff u2 lies on the boundary of URa1

.

34

definition of weak strict-rationalizability, such that a⇤i /2 WKi for K < •. To show

that a⇤i is not robust to inference, I construct a sequence of payoffs un ! u with the

property that a⇤i fails to be rationalizable in every complete information game un,

for n sufficiently large. The key feature of this construction is translation of weak

dominance under the payoffs u to strict dominance under the payoffs un. This is

achieved by iteratively increasing the payoffs to every action that survives to Wk+1i

by e, thus breaking ties in accordance with the selection in (Wki )k�1.

So, a necessary condition for robustness to inference is weak strict-rationalizability.

Next, I show that a sufficient condition is strong strict-rationalizability, and explain

the gap between these two conditions.

A sufficient condition for robustness to inference. The reason why weak strict-

rationalizability is not sufficient is because, unlike the analogous case for equilib-

rium, common certainty in the set URa⇤i

does not imply rationalizability of a⇤i .32 In

fact, even if beliefs are concentrated on a (vanishingly) small neighborhood of a

payoff function in Int(URa⇤i), it may be that a⇤i fails to rationalizable. See Appendix

D for such an example.

Remark 9 This example shows moreover that weak strict-rationalizibility is not lower

hemi-continuous in the uniform-weak topology. Since strong strict-rationalizability is lower-

32A simple example is the following. Consider the following two payoffs:

(u1)

a3 a4a1 1 0a2

34

34

(u2)

a3 a4a1 0 1a2

34

34

Action a1 is rationalizable for agent 1 in both complete information games, so u1, u2 2 URa1

. Butaction a1 is strictly dominated by action a2 if each game is equally likely, since in expectation payoffsare

a3 a4

a112

12

a234

34

35

hemicontinuous in the uniform-weak topology (Dekel et al., 2006; Chen et al., 2010), this

example suggests that subtleties in the definition of strict rationalizability have potentially

large implications for robustness.

The reason why common certainty of a shrinking set in URa⇤i

need not imply

rationalizability of a⇤i is because the chain of best responses rationalizing action

a⇤i can vary across URa⇤i

. In particular, it may be that the true payoffs u⇤ lie on

the boundary between two open sets of payoff functions, each with different

families of rationalizable actions. See Figure 1.4 below for an illustration. These

cases are problematic because even though a⇤i is rationalizable when agents (truly)

have common certainty in any payoff functions close to u⇤, it may fail to be

rationalizable if agents (mistakenly) believe that payoff functions on different sides

of the boundary are common certainty.

On the other hand, if a⇤i is strongly strict-rationalizable, then it can be justified

by a chain of strict best responses that remain constant on some neighborhood of

u⇤. It can be shown in this case that common certainty in a vanishing neighborhood

of u⇤ indeed implies rationalizability of a⇤i . This provides the sufficient direction.

u1u2

u3

U

URa⇤i

Figure 1.5: The set URa⇤i

is partitioned such that every agent’s set of rationalizable actions is constant acrosseach element of the partition. There are three cases: (1) if u⇤ is on the boundary of UR

a⇤i(e.g. u1), then a⇤i is not

robust to inference; (2) if u⇤ is in the interior of URa⇤i

, and moreover in the interior of a partition element (u2),then a⇤i is certainly robust to inference; (3) if u⇤ in the interior of UR

a⇤i, but not in the interior of any partition

element (u3), then a⇤i may not be robust to inference. See Appendix D for an example of the last case.

36

1.6 How Much Data do Agents Need?

Theorems 1 and 2 characterize the persistence of equilibria and rationalizable

actions given sufficiently large quantities of data. But in practice, the quantity of

data that agents observe about payoff-relevant parameters is limited. Robustness

to inference is meaningful only if convergence obtains in the ranges of data that

we can reasonably expect agents to observe. Therefore, I ask next, how much data

is needed for reasonable guarantees on persistence?

This section addresses this question by providing lower bounds for pRn (i, ai) and

pNEn (a). These bounds suggest a second, stronger criterion for equilibrium selection,

based in the quantity of data needed to reach a desired threshold probability. These

bounds also highlight the importance of various features of the solution and the

game, including the degree of strictness of the solution, and the complexity of the

inference problem.

1.6.1 Bayesian Nash Equilibrium

The following is a measure for the “degree of strictness" of a Nash equilibrium in

the complete information game with payoffs u⇤= g(q⇤). For any d � 0, say that

action profile a is a d-strict Nash equilibrium33 if

u⇤i (ai, a�i)� max

a0i 6=aiu⇤

i (a0i, a�i) > d 8 i 2 I .

33Replacing the strict inequality > with a weak inequality �, this definition reverses the morefamiliar concept of e-equilibrium, which requires that

u⇤i (ai, a�i)� max

a0i 6=aiu⇤

i (a0i , a�i) � �e 8 i, where e � 0.

The concept of e-equilibrium was introduced to formalize a notion of approximate Nash equilibria(violating the equilibrium conditions by no more than e). I use d-strict equilibrium to provide acardinal measure for the strictness of a Nash equilibrium (satisfying the conditions with d to spare).

37

Every strict Nash equilibrium a⇤ admits the following cardinal measure of strict-

ness:

dNEa = sup {d : a is a d-strict NE} ,

which represents the largest d for which a is a d-strict NE. This parameter describes

the amount of slack in the equilibrium a—action profile a remains an equilibrium

on at least a dNEa -neighborhood of the payoff function u⇤.

Proposition 2 Suppose a⇤ is a d-strict Nash equilibrium for some d � 0. Then, for every

n � 1,

pNEn (a⇤) � 1 � 2

dNEa⇤

EPn

supµ2M

kh(µZn)� u⇤k•

!

(1.5)

where h(n) =R

Q g(q)dn for every n 2 D(Q).

Remark 10 Uniform consistency of M implies that supµ2M kh(µZn)� u⇤k• ! 0 a.s.,

so for any strict Nash equilibrium a⇤, the bound in Proposition 2 converges to 1.34 This

implies also that the gap between pNEn (a⇤) and its lower bound in (1.5) converges to 0 as

the quantity of data n tends to infinity.

How can we interpret this bound? By assumption, action profile a⇤ is an

equilibrium in the complete information game with payoffs u⇤. But when n < •,

agents may have heterogenous and incorrect beliefs. The probability with which

a⇤ persists as an equilibrium under these modified beliefs is determined by two

components:

1 � 2dNE

a⇤|{z}

(1)

EPn

supµ2M


!

| {z }

(2)

.

First, it depends on the fragility of the solution a⇤ to introduction of heterogeneity

and error in beliefs. This is reflected in component (1): the bound is increasing

34This follows from continuity of the map h (see Lemma 4).

38

in the parameter dNEa⇤ . Intuitively, equilibria that are “stricter" persist on a larger

neighborhood of the true payoffs u⇤. It turns out that common certainty in the

dNEa⇤ /2-neighborhood of u⇤ is sufficient to imply that a⇤ is an equilibrium (see

Lemma 11).

Second, the probability pNEn (a⇤) depends on the expected error in beliefs. This is

reflected in the second component: kµZn � u⇤k• is the (random) error in estimated

payoffs using a fixed inference rule µ 2 M; so supµ2M kµZn � u⇤k• is the (random)

supremum error across inference rules in M; and (2) gives the expected supremum

error across inference rules in M. As n tends to infinity, this quantity tends to

zero,35 but the speed at which inference rules in M uniform converge to the truth

is determined by the “diversity" of inference rules in M, and by the statistical

complexity of the learning problem.

This first feature, diversity, can be thought of as a property of the relationship

between inference rules in M to each other. Holding fixed the rate at which

individual inference rules learn, the lower bound is lower when inference rules in

M jointly learn slower. How this occurs, and how much effect this can have on the

analyst’s confidence pNEn (a⇤), is discussed in detail in Section 6.3.

This second feature, complexity, can be thought of as a property of the rela-

tionship between inference rules in M and the data. For example, in Section 2,

the probability p(n, r) decreases in the dimensionality of the data. More generally,

when finite-sample bounds for the uniform rate of convergence of inference rules

in M are available, they can plugged into the lower bound in Proposition 2. This

technique is illustrated below for a new set of inference rules M.

Example 6 Let us consider agents who use ordinary least-squares regression to estimate a

relationship between p covariates and a real-valued outcome variable. An observation is a

35Since M is q⇤-uniform consistency and h is continuous.

39

tuple (xi, yi) 2 Z := Rp ⇥ R, where

yi = xTi b + ei

with xi ⇠i.i.d. N (0, Ip), ei ⇠i.i.d. N (0, 1), and xi and ei independent. Suppose that the

first coordinate of the coefficient vector b, denoted b1, is payoff-relevant. That is, Q = R,

and the true parameter is q⇤ = b1.

Recall that the least-squares estimate for the coefficient vector b is

b = (XTX)

�1XTY

where X is the matrix whose i-th row is given by xi, and Y is the matrix whose i-th row

is given by yi. Fix a sequence of constants fn that tends to 0. Let M consist of the set

of inference rules that map the data into a distribution with support in Bfn(b1). That is,

every inference rule maps the data into distribution with support in the fn-neighborhood of

the least-squares estimate for b1.

Corollary 1 Suppose the data-generating process and family of inference rules is as

described in the above example. Then, for every complete information game and d-strict

Nash equilibrium a⇤ (with d � 0),

pNEn (a⇤) � 1 � 2K

dNEa⇤

�

s2 p�p

n +

pp�

+ f2n�

and K is the Lipschitz constant36 of the map g : Q ! U.

Thus, the lower bound is decreasing in the number of covariates p (i.e. the analyst is less

confident in predicting a⇤ when the number of covariates is larger). The proof can be found

in the appendix.

36Assuming the sup-norm on U and the Euclidean norm on Q.

40

1.6.2 Rationalizable Actions

We can now repeat the previous exercise for the solution concept of rationalizability.

The following is a measure for the “degree" of rationalizability of an action ai in the

complete information game with payoffs u⇤. For any d � 0, say that the family of

sets (Rj)j2I is closed under d-best reply if for every agent j and action aj 2 Rj, there

is some distribution a�j 2 D(R�j) such that

u⇤j (aj, a�j) > u⇤

j (a0j, a�j) + d 8 a0j 6= aj. (1.6)

Say that action ai is d-strict rationalizable for agent i if there exists some family

(Rj)j2I , with ai 2 Ri, that is closed under d-best reply. Every strictly rationalizable

action ai admits the following cardinal measure of the degree of strictness:

dRai= sup{d : ai is d-strict rationalizable}.37

This parameter describes the amount of slack in the rationalizability of action

ai—that is, action ai remains rationalizable for agent i on at least a dRai

-neighborhood

of the payoff function u⇤.

Remark 11 This definition is equivalent to requiring that ai survive a more general version

of strong strict-rationalizability, where the inequality in (1.4) is replaced by

Z

Q⇥T�i⇥A�i

�

ui(ai, a�i, q)� ui(a0i, a�i, q)�

p. > d 8 a0i 6= ai,

so that ai yields at least d more than the next best action given the distribution p.38

37I abuse notation here and write dRai

instead of dRi,ai

. Again, this parameter is defined only if ai isd-strict rationalizable for some d � 0.

38A similar procedure is introduced in Dekel et al. (2006). The above definition makes thefollowing modifications: first, the inequality to be strict; second, d appears on the right-hand side ofthe inequality, instead of �d.

41

Proposition 3 Suppose action a⇤i is d-strict rationalizable for some d > 0. Then, for every

n � 1,

pRn (i, a⇤i ) � 1 � 2

dRa⇤i

EPn

supµ2M


!

,

where h(n) =R

g(q)dn for any n 2 D(Q).

Proof 6 See appendix.

Again, we see that the lower bound is increasing in the “strictness" of the

solution, as measured through the parameter dRa⇤i

, and in the speed at which

expected payoffs using inference rules in M uniformly converge to the true payoffs

u⇤. As before, when finite-sample bounds are available, they can be used to derive

closed-form expressions for this bound.

Corollary 2 Suppose the data-generating process and family of inference rules are as

described in Example 6. Then, for every complete information game, agent i, and d-strict

rationalizable action a⇤i (with d � 0),

pRn (i, a⇤i ) � 1 � 2K

dRa⇤i

�

s2 p�p

n +

pp�

+ f2n�

where K is the Lipschitz constant39 of the map g : Q ! U.

1.6.3 Diversity across Inference Rules in M

I conclude this section with a brief discussion regarding the dependence of pNEn (a)

and pRn (i, ai) on the diversity across inference rules in M. To isolate this effect from

properties of individual inference rules, let us fix the marginal distributions of

µ(Zn) for every µ 2 M, and vary the joint distribution of the random variables

39Assuming the sup-norm on U and the Euclidean norm on Q.

42

(µ(Zn))µ2M. Proposition 4 below provides upper and lower bounds for pNEn (a)

and pRn (i, ai). These bounds can be understood from the following simple example.

Example 7 Recall the game from Section 2 with payoffs

A NA

A q, q 0, 12

NA 12 , 0 1

2 , 12

where Q = {�1, 1}. Fix a quantity of data n < •, and suppose that M consists of two

inference rules µ1, µ2 with marginal distributions

µ1(Zn) ⇠ 14

d�1 +34

d1

µ2(Zn) ⇠ 34

d�1 +14

d1

That is, with probability 14 , data zn is generated such that µ1(zn) is degenerate on �1, and

with probability 34 , data is generated such that µ1(zn) is degenerate on 1. (The distribution

of µ2(Zn) is interpreted similarly.) Given these distributions, what are the largest and

smallest possible values of pNEn ((A, A))?

First observe that action profile (A, A) is an equilibrium if and only if data zn is realized

such that µ1(zn) = µ2(zn) = d1. Otherwise, A is strictly dominated for the agent with

first-order belief d�1. At one extreme, µ1(Zn) and µ2(Zn) may be correlated such that

µ1(zn) = d1 for every dataset zn where µ2(zn) = d1. Then,

pNEn (a) = Pr({zn : µ2(zn) = d1}) = 1

4.

If instead, µ1 and µ2 are independent, then

pNEn (a) = Pr({zn : µ1(zn) = d1})Pr({zn : µ2(zn) = d1}) =

✓

14

◆✓

34

◆

<14

.

This quantity is further reduced if µ2(zn) = d1 implies that µ1(zn) = d�1, in which case

43

pNEn (a) = 0.

These observations can be generalized as follows for arbitrary finite M. For

every inference rule µ and quantity of data n � 1, define

pNEµ,n (a) := Pr

⇣

h(

µZn) 2 UNEa

⌘

(1.7)

This is the probability that action profile a is a Nash equilibrium if every agent

has beliefs degenerate on the prediction of inference rule µ. Define pRµ,n(i, ai)

analogously, replacing UNEa with UR

aiin (1.7).

Proposition 4 Suppose M is finite, and the marginal distributions (µ(Zn))µ2M are fixed.

Then,

1 � Âµ2M

pNEµ,n (a) pNE

n (a) 1 � minµ2M

pNEµ,n (a)

and

1 � Âµ2M

pRµ,n(i, ai) pR

n (i, ai) 1 � minµ2M

pRµ,n(i, ai).

The upper bound corresponds to co-monotonic random variables, and the lower

bound, when attainable, corresponds to counter-monotonic random variables. In

the co-monotonic case, different inference rules err in inference of payoffs on the

same sets of data, whereas in the counter-monotonic case they err on datasets that

are as non-overlapping as possible.

1.7 Extensions

The following section provides brief comment on and extension to various inference

ruleing choices made in the main framework.

44

1.7.1 Misspecification

Proposition 1 shows that q⇤-uniform consistency is both necessary and sufficient

for common learning, and I assume in the remainder of the paper that the family

of inference rules M is q⇤-uniformly consistent. But continuity in equilibrium

sets (and rationalizable sets) does not require common learning. Can we obtain

Theorems 1 and 2 under a weakening of this property?

In fact, it is neither necessary that individual inference rules are consistent,

nor necessary that inference rules uniformly converge. I introduce a relaxation of

uniform consistency below.

Definition 5 (Almost q⇤-uniform consistency.) For any e � 0, say that the class of

inference rules M is (e, q⇤)-uniformly consistent if

limn!•

supµ2M

dP(µ(Zn), dq⇤) e a.s.


This says that a class of inference rules is almost q⇤-uniformly consistent if the

set of plausible first order beliefs converges40 almost surely to a neighborhood

of the true parameter. Notice that uniform consistency is nested as the e = 0

case. The proofs of Theorems 1 and 2 are easily adapted to show the following

result. (In reading this, recall that if M is (e, q⇤)-uniformly consistent, then it is

also (e0, q⇤)-uniformly consistent for every e0 > e.)

Proposition 5 Assume NI and R.

1. The rationalizability of action a⇤i is robust to inference if dRa⇤i> 0 and M is

⇣

dRa⇤i

, q⇤⌘

-

uniformly consistent.

40In the Hausdorff distance induced by dP.

45

2. The equilibrium property of a⇤ is robust to inference if dNEa⇤ > 0 and M is

�

dNEa⇤ , q⇤

�

-

uniformly consistent.

1.7.2 Private Data

In the main text, I assume that agents observe a common dataset. How do the main

results change if agents observe private data? Cripps et al. (2008) have shown that

if Z is unrestricted, then common learning may not occur even if |M| = 1 (so that

M contains a single inference rule). It is also known that strict Nash equilibria need

not be robust to higher-order uncertainty about opponent data (see e.g. Carlsson

and van Damme (1993), Kajii and Morris (1997)). Thus, extension to private data

requires restrictions on beliefs over opponent data that are beyond the scope of this

paper.

In the simplest extension, however, we may suppose that players observe

different datasets (zi)i2I , independently drawn from the same distribution, but

each has an (incorrect) degenerate belief that all opponents have seen the same data

that he has. Then, Theorems 1 and 2 hold as stated, and the bounds in Propositions

2 and 3 are revised as follows.

Proposition 6 Suppose a⇤ is a d-strict Nash equilibrium for some d > 0. Then, for every

n � 1,

pNEn (a⇤) �

1 � 2dNE

a⇤EPn

supµ2M


!!I

where I is the number of players. Suppose a⇤i is d-strict rationalizable for some d. Then, for

every n � 1,

pRn (i, a⇤i ) �

1 � 2dR

a⇤i

EPn

supµ2M


!!I

.

46

1.7.3 Limit Uncertainty

In the main text, I assume that agents learn the true parameter as the quantity of

data n tends to infinity, so that the limit game is a complete information game. This

approach can be extended such that the limit game has incomplete information. Fix

a distribution n 2 D(Q)—a limit common prior—and rewrite uniform consistency

as follows:

Definition 6 (Limit Common Prior.) The set of inference rules M has a limit common

prior n if

supµ2M

dP(µ(Zn), n) ! 0 a.s.


Then, taking u⇤ := h(n) to be the expected payoff under n, all the results in

Section 5 follow without modification.

1.8 Related Literature

This paper makes a connection between the literature regarding robustness of

equilibrium to specification of agent beliefs, and the literature that studies agents

who learn from data. I discuss each of these literatures in turn.

1.8.1 Robustness of equilibrium and equilibrium refinements

The following question has been the focus of an extensive literature: Suppose an

analyst does not know the exact game that is being played. Which solutions in his

inference rule of the game can be guaranteed to be close to some solution in all

nearby games?

47

Early work on this question considered “nearby" to mean complete information

games with close payoffs (Selten, 1975; Myerson, 1978; Kohlberg and Mertens, 1986).

Fudenberg et al. (1988) proposed consideration of nearby games in which players

themselves have uncertainty about the true game. This approach of embedding

a complete information game into games with incomplete information has since

been taken in several papers under different assumptions on beliefs. For example:

Carlsson and van Damme (1993) consider a class of incomplete information games

in which beliefs are generated by (correlated) observations of a noisy signal of

payoffs of the game. Kajii and Morris (1997) study incomplete information games in

which beliefs are induced by general information structures that place sufficiently

high ex-ante probability on the true payoffs.

I ask which solutions of a complete information game persist in nearby incom-

plete information games, where the definition of nearby that I use differs from

the existing literature in the following ways: First, I place a strong restriction on

(interim) higher-order beliefs, which has the consequence that agents commonly

learn the true parameter. This contrasts with Carlsson and van Damme (1993)

and Kajii and Morris (1997), in which—even as perturbations become vanishingly

small—agents consider it possible that other agents have beliefs about the un-

known parameter that are very different from their own. In particular, failures of

robustness due to standard contagion arguments do not apply in my setting; thus,

I obtain rather different robustness results.41

Second, while the restriction I place on interim beliefs is stronger in the sense

41For example, the construction of beliefs used in Weinstein and Yildiz (2007) to show failure ofrobustness (Proposition 2) relies on construction of tail beliefs that place positive probability on anopponent having a first-order belief that implies a dominant action. A similar device is employedin Kajii and Morris (1997) to show that robust equilibria need not exist (see the negative examplein Section 3.1). These tail beliefs are not permitted under my approach. When the quantity of datais taken to be sufficiently large, it is common certainty (with high probability) that all players havefirst-order beliefs close to the true distribution, so the process of contagion cannot begin.

48

described above, I do not require that these beliefs are consistent with a common

prior. This allows for common knowledge disagreement, which is not permitted in

either Carlsson and van Damme (1993) or Kajii and Morris (1997).

Finally, the class of perturbations that I consider are motivated by a learning

foundation (this aspect shares features with Dekel et al. (2004) and Esponda (2013),

but agents in this paper learn about payoffs only, and not actions). I interpret the

sequence of interim types as corresponding to learning from a fixed number of

observations. This motivates a departure from the literature in studying solution

sets not just in nearby games (large n), but also in far games (small n). In particular,

I suggest that we can characterize the degree of robustness by looking at the

persistence of solutions in small-n games.

1.8.2 Role of higher-order beliefs

A related literature studies the sensitivity of solutions to specification of higher-

order beliefs. Early papers in this literature (Mertens and Zamir, 1985; Branden-

burger and Dekel, 1993) considered types to be nearby if their beliefs were close up

to order k for large k (corresponding to the product topology on types). Several au-

thors have shown that this notion of close leads to surprising and counterintuitive

conclusions, in particular that strict equilibria and strictly rationalizable actions are

fragile to perturbations in beliefs (Rubinstein, 1989; Weinstein and Yildiz, 2007).

These findings have motivated new definitions of “nearby" types. Dekel et al.

(2006) characterize the coarsest metric topology on types under which the desired

continuity properties hold. This topology is defined via strategic properties of

types, instead of directly on beliefs. Chen et al. (2010) subsequently developed a

(finer) metric topology on types—the uniform-weak topology—which is defined

explicitly using properties of beliefs. In this topology, two types are considered

49

close if they have similar first-order beliefs, attach similar probabilities to other

players having similar first-order beliefs, and so forth.

The perturbations in beliefs that I allow for are perturbations in the uniform-

weak topology. Specifically, the type spaces that I look at—that is, all type profiles

with common certainty in the predictions of a set of inference rules M—converge in

this topology to the singleton type space containing the type with common certainty

in the true parameter. Thus, robustness to inference can be interpreted as requiring

persistence across a subset of perturbations in the uniform-weak topology.42 A

related study is taken in Morris et al. (2012) and Morris and Takahashi, where

approximate common certainty in the true parameter is considered, instead of

common certainty in a neighborhood of the true parameter.

1.8.3 Agents who learn from data

The set of papers including Gilboa and Schmeidler (2003), Billot et al. (2005), Gilboa

et al. (2006), Gayer et al. (2007), and Gilboa et al. (2013) propose an inductive or

case-based approach to modeling economic decision-making. The present paper

can be interpreted as studying the strategic behaviors of case-based learners when

there is uncertainty over the inductive inference rules used by other agents.

There is also a body of work that studies asymptotic disagreement between

agents who learn from data. Cripps et al. (2008) study agents who use the same

Bayesian inference rule but observe different (private) sequences of data; Al-Najjar

(2009) study agents who use different frequentist rules to learn from data; and

Acemoglu et al. (2015) study Bayesian agents who have different priors over the

signal-generating distribution. My model of belief formation shares many features

42The characterizations of robustness in this paper are possibly unchanged if agents have commonp-belief in the predictions of inference rules in M, where p ! 1 as the quantity of data tends toinfinity. I leave verification of this for future work.

50

with these inference rules, but the main object of study is the convergence of

equilibrium sets, instead of the convergence of beliefs.

Finally, Steiner and Stewart (2008) study the limiting equilibria of a sequence of

games in which agents use a kernel density estimator to infer payoffs from related

games. This paper is conceptually very close, but there are several important

differences in the approach. For example, Steiner and Stewart (2008) suppose that

agents share a common inference rule and observe endogenous data (generated by

past, strategic actors), while I suppose that agents have different inference rules

and observe exogenous data. Additionally, the (common) inference rule in Steiner

and Stewart (2008) is not indexed by the quantity of data, so the limit of their

learning process is a game with heterogeneous beliefs, whereas the limit of my

process is a game with common certainty of the true distribution.

1.8.4 Model uncertainty

Consideration of model uncertainty in game theory is largely new, but similar

ideas have been advanced in several neighboring areas of economics. Eyster and

Piccione (2013) study an asset-pricing inference rule in which agents have different

incomplete theories of price formation. The set of papers including Hansen and

Sargent (2007), Hansen and Sargent (2010), Hansen and Sargent (2012), and Hansen

(2014), among others, consider the implications of model uncertainty for various

questions in macroeconomics. In their framework, a decision-maker considers

a set of model (prior distributions) plausible, and uses a max-min criterion for

decision-making.

51

1.8.5 Epistemic game theory

I extensively use tools, results, and concepts from various papers in epistemic

game theory, including Monderer and Samet (1989), Brandenburger and Dekel

(1993), Morris et al. (1995), Dekel et al. (2007), Chen et al. (2010). The notion of

common certainty in a set of first-order beliefs was studied earlier in Battigalli and

Sinischalchi (2003).

1.9 Discussion

Directions for future work include the following:

Endogenous data. In this paper, data is generated according to an exogenous

distribution. An important next step is to consider data generated by actions

played by past strategic actors. In this dynamic setting, past actions play a role in

coordinating future beliefs via the kind and quantity of data generated.

Optimal informational complexity design. Suppose a designer has control over

the complexity of information disclosed to agents in a strategic setting. Using the

approach developed in this paper, the designer’s choice of complexity influences

the commonality in beliefs across agents. When will he choose to disclose simpler

information, and when will he disclose information that is more complex? If the

designer’s interests are opposed to those of the agents, should a social planner

regulate the kind of information he can provide?

Confidence in predictions. An action profile is usually thought of as having the

binary quality of either being, or not being, a solution. The approach in this paper

may provide a way to qualify such statements with a level of confidence. In this

paper, pn(a) describes the analyst’s confidence in predicting a given n observations.

I hope to extend these ideas towards construction of a cardinal measure for the

52

strength of equilibrium predictions across different games.

53

Chapter 2

The Theory is Predictive, but is it

Complete? An Application to

Human Perception of

Randomness1

2.1 Introduction

When we test theories, it is common to focus on what one might call correctness:

do the predictions of the theory match what we see in the data? For example, we

can test a theory that says that wages are determined by one’s knowledge and

capabilities, by looking at whether more education indeed predicts higher wages

in labor data. Such a finding would provide evidence in support of the theory, but

little guidance towards whether an alternative theory might fit the data even better.

1Co-authored with Jon Kleinberg and Sendhil Mullainathan

54

Beyond correctness, we also care about this latter feature, which we will refer to as

completeness: how much of the explainable variation in the data is captured by the

theory?

Measurement of the completeness of our theories is important because it

provides guidance on the marginal (predictive) gain of improvement in modeling.

If our models are fairly complete, then new theories should not be expected

to drastically improve predictive power (they may, of course, aid towards other

goals—for example, by providing new conceptual insight into the problem). In

contrast, if our models are far from complete, then new models have the potential

to lead to large improvements in prediction. Completeness therefore guides both

our understanding of the achievements already made by existing models, and also

of the potential progress that remains ahead.

Despite an interest in completeness, we focus on correctness in social science

for a pragmatic reason. We can measure the fit of any given theory to data, but

we have no intuition for what constitutes a “good" fit. For example, suppose

we are interested in predicting a binary variable and find that a given theory

predicts accurately in 55% of observed trials. Is this achievement significant? For

certain problems—e.g. prediction of changes in stock returns given a past history

of returns—55% accuracy is a stunning success. In other problems—e.g. prediction

of college matriculation given socioeconomic and other personal characteristics—it

is only mediocre. This significant variation in predictability across problems means

that perfect accuracy is not a universally appropriate benchmark for our theories:

we need to understand how well a theory’s predictive power lines up against some

best achievable accuracy.

The purpose of this paper is to propose a practically implementable way to gen-

erate this benchmark, via methods in machine learning. As we discuss more fully

55

in Section 5, this is an approach that is also being proposed contemporaneously

with our work by Peysakhovich and Naecker (2016). Recent advances in machine

learning have enabled substantial progress in problems of prediction, but are often

criticized for using atheroetical and uninterpretable models, for example by search-

ing for the best prediction function over a large set of explanatory variables. The

resulting prediction functions perform well empirically but rarely reveal a deep

theoretical structure.

Rather than considering machine learning as a replacement for existing theories,

a role for which it (currently) seems ill-suited, our goal is to leverage its techniques

towards the alternative goal of assessing theory completeness. The approach

we propose is simple: compare the performance of existing (interpretable and

economically meaningful) models to the performance of atheoretical machine

learning algorithms.

We illustrate this approach on a simple problem with a long history of study in

psychology and behavioral economics: human generation of random sequences. It

is well documented that humans misperceive randomness (Bar-Hillel & Wagenaar,

1991; Kahneman & Tversky 1972), with implications in many economic settings

(Rabin and Vayanos, 2010; Chen et al.). Leading models for human misperception of

randomness include Rabin (2002) and Rabin and Vayanos (2010). We are interested

in assessing the success of these models towards predicting human generation of

fair coin flips.

To this end, we use the platform Mechanical Turk to collect 14,050 strings of

length eight, produced as if by flipping a fair coin several times in succession. We

ask two questions. First: can we predict the eighth flip in a string if we are given

the first seven? Second: can we separate human-generated strings from strings

generated by a true Bernoulli(0.5) process in a mixed sample?

56

We adapt Rabin (2002) and Rabin and Vayanos (2010) for these prediction

problems, and find that these models achieve (mean-squared) prediction errors

of approximately 0.249. These prediction errors are improvements upon a naive

baseline of 0.25 (corresponding to guessing at random). The problem of interest

regards the interpretation of the improvement of approximately 0.001 on the naive

baseline. How significant is this reduction in prediction error, and how much could

we hope to improve upon it? To answer these questions, we need a benchmark for

achievable prediction error.

There are several ways to construct such a benchmark using machine learning

prediction techniques. For our first set of results, we draw on an appealing

property of our domain: despite its conceptual richness, it has a compact enough

representation that we can construct an essentially “perfect” benchmark based

on table lookup, an algorithm that uses the empirical distribution of the full set of

combinatorially distinct strings in a training set to predict new strings. Using this

approach, we achieve prediction errors of approximately 0.243. If we take this to

be our benchmark, then existing behavioral models produce roughly 12% of the

achievable improvement in prediction error for this problem.

In the remainder of the paper, we outline and respond to two potential critiques

of this method for measuring completeness. The first is feasibility. Table lookup

can be implemented in the given problem because of the size of the domain space,

but is not a generally viable strategy. How sensitive is the estimated measure of

completeness to the choice of machine learning algorithm? Towards this concern,

we report the the prediction error achieved by two alternative (standard) machine

learning approaches: LASSO regression and a decision tree algorithm. The best

of these algorithms achieves a prediction error of approximately 0.243, suggesting

that the benchmark constructed using table lookup may be approximated by

57

other machine learning algorithms that scale to problems with more complex

representations.

A second direction of concern regards whether the estimated ratio is special to

the problem of prediction of eight-length strings of coin flips. This may be the case

if, for example, table lookup succeeds by capturing specific features of generation

of eight-length H/T strings that do not generalize to related problems. We thus

examine the stability of estimated completeness when we change the prediction

task to predicting data collected in related but non-identical contexts. Specifically,

we learn a model for human generation of randomness using the original data of

eight-length coin flips, and then use this model to predict strings generated in a

nearby domain.

We consider two variations in the domain. In Section 3.1, we relabel the possible

outcomes: instead of asking subjects to generate binary strings generated as if

repeatedly flipping a fair coin labelled “Heads" and “Tails", we impose new frames

in which the coin is either labelled “@" on one side and “!" on the other, or “r"

on one side and “2" on the other. In Section 3.2, we change the index of the flip

to be predicted: instead of asking subjects to generate eight repeated coin flips

and predicting the eighth, we ask subjects to generate fifteen coin flips and try to

predict flips 9-15. We find that in these modified prediction problems, the existing

models produce between 7-30% of the improvement in prediction error obtained

using table lookup, providing evidence that the benchmark and ratio discovered

previously are indeed stable across local problem domains.

Taken together, these results suggest that: (1) there is a significant amount of

structure in problem of prediction of human generation of randomness that existing

models have yet to capture and (2) machine learning may provide a generally viable

approach to testing theory completeness.

58

2.2 Primary setting: human generation of coin flips

2.2.1 Description of data

We asked 334 subjects on Mechanical Turk to generate 50 binary strings of length

eight, each, as if these strings were the realizations of 50 experiments in which a

fair coin was flipped 8 times. The task was described to subjects using the text

below:

We are researchers interested in how well humans can produce random-ness. A coin flip, as you know, is about as random as it gets. Your jobis to mimic a coin. We will ask you to generate 8 flips of a coin. You areto simply give us a sequence of Heads (H) and Tails (T) just like whatwe would get if we flipped a coin.

Important: We are interested in how people do at this task. So it isimportant to us that you not actually flip a coin or use some otherrandomizing device.

Entry of the coin flips was implemented through eight drop-down menus, each of

which had the options “H" and “T". Subject effort was incentivized through the

following text:

To encourage effort in this task, we have developed an algorithm (basedon previous Mechanical Turkers) that detects human-generated coinflips from computer-generated coin flips. You are approved for pay-ment only if our computer is not able to identify your flips as human-generated with high confidence.

Additionally, to discourage use of an external randomizing device, we required

subjects to complete each string in 30 seconds or less. The complete set of directions

can be found in Appendix A.

The “algorithm" that we use for detection of lazy subjects identified the 26

strings whose empirical frequency exceeded the 90th percentile across strings

59

(0.0059). These strings were categorized as over-generated, and subjects who pro-

duced 20 or more over-generated strings were removed from the data. In total, this

criterion removed 53 subjects, or 2900 strings. The remaining dataset consists of

a total of 281 unique subjects, or 14,050 unique strings. Throughout, we identify

Heads with ‘1’ and Tails with ‘0,’ so that each string is an element of {1, 0}8.

Figure 2.1: (a) Top row. Distribution of the number of heads in the realized string. Left: Comparison ofMTurk data with theoretical Bernoulli predictions. Right: Comparison of Nickerson & Butler (2009) data withtheoretical Bernoulli predictions. (b) Bottom row. Distribution of proportion of runs which are of length m.Left: Comparison of MTurk data with theoretical Bernoulli predictions. Right: Comparison of Nickerson &Butler (2009) data with theoretical Bernoulli predictions.

We find that the observed distribution over strings is unlikely to have been

generated by a true Bernoulli(0.5) process: the hypothesis that the true distribution

over {1, 0}8 is uniform is rejected under a c2 test with p ⇡ 0. Moreover, the nature

of mis-generation is qualitatively consistent with comparative references (we use

Nickerson and Butler (2009) and Rapaport and Budescu (1997)). For example,

there is an over-tendency towards alternation (52.16% of flips are different from

the previous flip, as compared to an expected 50% in a Bernoulli(0.5) process), an

60

under-tendency to generate strings with “extreme" ratios of Heads to Tails (see the

top row of Figure 2.1), and an under-tendency to generate strings with long runs

(see the bottom row of Figure 2.1).

Additionally, subjects display strong context-dependency: the probability of

reversal depends on several previous flips. Table 2.1 compares this dependence

in our data with statistics reported in Rabin and Vayanos (2010) (using data from

Rapaport and Budescu (1997)), listing the respective probabilities that various

three-flip patterns are followed by ‘1.’ Except for a much softer contrast in our data

between the probability with which ‘0000 and ‘1110 are followed by ‘1’, we find that

these conditional probabilities are quite similar.

Our data Rapaport and Budescu (1997) Bernoulli0 1 0 0.5995 0.588 0.51 0 0 0.5406 0.62 0.50 0 1 0.5189 0.513 0.50 0 0 0.5185 0.70 0.51 1 1 0.4811 0.30 0.50 1 1 0.4595 0.38 0.51 1 0 0.4528 0.487 0.51 0 1 0.4415 0.412 0.5

Table 2.1: The empirical probability of Heads, conditional on three fixed previous flips: (1) the actualproportion of generated Heads in our data, (2) the assessed probability of Heads next flip from Rapaport &Budescu (1997), as presented in Rabin and Vayanos (2010), (3) probabilities consistent with a Bernoulli(0.5)process.

2.2.2 Theories of misperception

Motivated by empirical findings such as those described above, several frameworks

have been proposed for modeling human misperception of randomness. We

consider in particular the two approaches proposed in Rabin (2002) and Rabin and

Vayanos (2010).

61

Rabin (2002) models subjects who observe i.i.d. signals, but mistakenly believe

them to be negatively autocorrelated. Specifically, subjects observe a sequence of

i.i.d. draws from a Bernoulli(q) distribution, where q 2 [0, 1] is an unknown rate

drawn from distribution p. Although subjects know the correct distribution p,

they have a mistaken belief about the way in which the realized rate q determines

the signal process. Subjects believe that the observed signals are drawn without

replacement from an urn containing qN ‘1’ signals and (1 � q)N ‘0’ signals, so

that a signal of ‘1’ is less likely following observation of ‘0’, and vice versa. For

convenience, the author imposes an additional, stylized, assumption in which

subjects believe that the urn is “refreshed" every other round, meaning that the

composition is returned to qN ‘1’ signals and (1 � q)N ‘0’ signals.

This model is primarily intended as a model of mistaken inference, and not

directly as a model of generation of random sequences, so a few adaptations are

needed to carry this model into our setting. We alter the model in the following

ways: First, since subjects are told the bias of the coin (fair), we fix the distribution

p over rates so subjects know that q = 0.5 with certainty. Second, we relax the

assumption that the urn is refreshed deterministically every other round, adding

in a second parameter p 2 [0, 1], which determines the probability that the urn is

refreshed. In this revised model, subjects generate random sequences by drawing

without replacement from an urn that is initially composed of 0.5N ‘1’ balls and

0.5N ‘0’ balls, and is subsequently refreshed with probability p before every draw.

The model proposed in Rabin and Vayanos (2010) is similar in spirit, but richer.

We use the following version of their model, which is closest to our setting. Each

subject generates the first flip, s1, according to a true Bernoulli(0.5) distribution.

62

Then, each subsequent flip, sk, is determined according to

sk ⇠ Ber

0.5 � a7

Âk=0

dk(2s7�k � 1)

!

where the constant d 2 [0, 1] captures a (decaying) influence of past flips, and

the constant a � 0 measures the strength of negative autocorrelation. Notice that

2sk � 1 = 1 if sk = 1 and 2sk � 1 = �1 if sk = 0, so that past instances of ‘1’

reduce the probability that the k-th flip is ‘1’, and past instances of ‘0’ increase this

probability.

2.2.3 Prediction tasks

We test these theories by looking at how well they predict features of our data. We

focus on two tests in particular. In the first test, which we refer to as continuation,

we try to guess a subject’s eighth flip from his first seven flips. A prediction rule

for this problem is any map

f : {0, 1}7 ! [0, 1]

that takes 7-length strings into the probability that the eighth flip is ‘1’. The error

in predicting a dataset {si}ni=1 of n strings is measured using mean-squared error:

L( f ) =n

Âi=1

⇣

si8 � f (si

1:7)⌘2

.

To gain some intuition on the possible values of L( f ), let us briefly note the

following.

Fact 1 Suppose that strings in {si}ni=1 are generated i.i.d. from a Bernoulli(0.5) distribu-

tion. Then, E[L( f )] � 0.25 for every possible prediction rule f .

That is, if subjects are truly generating strings according to a Bernoulli(0.5) process,

63

then we cannot improve on an expected mean-squared error of 0.25. If, instead, sub-

jects are generating strings according to either of the behavioral models described

above, then we can do better by leveraging the first seven flips. Optimal prediction

rules (in the sense of minimizing expected mean-squared loss) are defined for each

of these models in Appendix B, and denoted fR and fRV respectively.

In the second test, which we refer to as classification, we are presented with

a dataset of strings—half generated by human subjects, and half generated by a

Bernoulli(0.5) process—and seek to separate the human-generated strings from the

computer-generated strings. A prediction rule in this problem is any map

c : {0, 1}8 ! [0, 1]

from eight-length strings into a probability that the string was generated by a

human subject. The error in predicting a dataset {si}ni=1 of n strings is measured

using mean-squared error

L(

c)

=

n

Âi=1

⇣

ci � c⇣

si⌘⌘2

,

where ci= 1 if the true source of generation for string si was a human subject, and

ci= 0 otherwise.2 In an abuse of notation, we use L to refer to the loss function

in both problems, trusting that no confusion will arise. As above, we have the

following.

Fact 2 Suppose that strings in {si}ni=1 are generated i.i.d. from a Bernoulli(0.5) distribu-

tion. Then, E[L(c)] � 0.25 for every possible prediction rule c.

2A brief comment on the relationship between these two prediction tasks. One may wonderwhether success on one implies success on another. This need not be so. Observe that success onthe continuation task is achieved by correctly assessing the likelihood of s1:71 versus s1:70. If thisratio is close to the true ratio for every s1:7, the unconditional probability of either string s1:71 or s1:70occurring can be very off.

64

Thus, if strings are truly generated according to a Bernoulli(0.5) process, then we

cannot improve on an expected prediction error of 0.25. If, instead, the strings

are generated according to either behavioral model above, then we can improve

upon this error. We define the optimal prediction rules (in the sense of minimizing

expected mean-squared loss) for these models in Appendix B, and refer to them

(respectively) as cR and cRV .

Following, we test prediction rules fR, fRV in the continuation task on the

Mechanical Turk data, and prediction rules cR, cRV in the classification test, given a

merged dataset consisting of the Mechanical Turk data and an equal number of

strings generated according to a Bernoulli(0.5) process. The reported prediction

error is obtained using ten-fold cross validation: we (randomly) partition the data

into 10 equally-sized subsets, estimate the free parameters of the model on nine

subsets (the training set), and predict the strings in the tenth (test set). The reported

error is an average across choices of test set.

Continuation ClassificationGuessing 50-50 0.25 0.25

Rabin (2002) 0.2495 0.2493(0.0001) (0.0001)

Rabin and Vayanos (2010) 0.2491 0.2495(0.0001) (0.0001)

Table 2.2: Prediction errors achieved using Rabin (2002) and Rabin (2010) are improvements on theprediction error achieved by guessing at random. How do we assess the size of this improvement?

In Table 2.2, we compare the obtained prediction errors with a naive baseline in

which we predict ‘1’ with probability 0.5 in the continuation task, and classify each

string as human generated with probability 0.5 in the classification task. The core

motivation for this paper is clearly seen here: Both behavioral models are more

predictive than the naive baseline, but the improvement on the naive baseline of

65

(up to) 0.0009 is extremely difficult to interpret. How much have these models

improved upon the naive prediction rule, and how much could we further hope to

improve upon it? To answer these questions, we need a benchmark for obtainable

prediction error that is much more suitable than 0.

2.2.4 Establishing a benchmark

As we discussed in the introduction, our proposed benchmark is the prediction

error achieved using a technique we refer to as table lookup.

Definition 7 (Table Lookup) Let g be the empirical distribution over strings in the

training data. The table lookup continuation rule is

fT(s1:7) =g(s1:71)g(s1:7)

for all s1:7 2 {1, 0}7. (2.1)

where ‘s1:71‘ is the concatenation of strings s1:7 and ‘1.’ The table lookup classification rule

is cT(s) = g(s).

In the continuation task, the table lookup prediction rule assigns to every string

s1:7 2 {1, 0}7 the empirical frequency with which the string is followed by ‘1’ in

the training data. In the classification task, the table lookup prediction rule assigns

to every string its empirical frequency in the training data. Under the assumption

that strings are i.i.d. across subjects, the prediction error achieved using this rule

approaches the “lowest possible" prediction error with sufficient training data. This

is equivalent to the irreducible error in the problem, or the Bayes error rate.

Table 2.3 compares the prediction error achieved using the behavioral models

above with the prediction error achieved using table lookup. As before, prediction

error is evaluated using 10-fold cross validation.

We find that table lookup achieves a prediction error of 0.2425 in the contin-

uation task and 0.2430 in the classification task. These errors are far from 0, so

66

this is a nontrivial modification from a benchmark of no error. A simple measure

of “completeness" of the existing theories is the ratio of improvement in predic-

tion error achieved by the best behavioral model (over the naive approach) to the

improvement in prediction error achieved by table lookup. For example, in the

continuation task,

ratio of improvement =0.25 � min(0.2495, 0.2491)

0.25 � 0.2425= 0.1233

and in the classification task,

ratio of improvement =0.25 � min(0.2494, 0.2495)

0.25 � 0.2430= 0.0857.

Taking table lookup as our benchmark for obtainable prediction error, these

results suggest that existing behavioral models produce between 9-12% of the

achievable improvement in prediction error.

Continuation ClassificationBernoulli 0.25 0.25

Rabin (2002) 0.2495 0.2494(0.0001) (0.0001)

Rabin and Vayanos (2010) 0.2491 0.2495(0.0001) (0.0001)

Table Lookup 0.2425 0.2430(0.0001) (0.0001)

Completeness using TL as a benchmark 0.1233 0.0857

Table 2.3: Comparison of prediction error achieved using behavioral models with prediction error achievedusing table lookup. The behavioral models explain between 9-12% of the explainable variation in the data.

2.2.5 Other possible benchmarks

Table lookup is feasible in this problem because of the size of the domain space (27

unique strings in the continuation task, and 28 unique strings in the classification

67

task). We can in fact search the space of predictive models to optimality, an

approach that will not be feasible in every problem. Can we use other (practically

implementable) machine learning algorithms as surrogate benchmarks in these

other cases?

In this section, we consider use of two alternative benchmarks. First, we use

LASSO regression to select a prediction rule that depends only a small set of

covariates. The set of features we consider are:

• the empirical frequency of alternations (the probability that flip sk is followed

by its opposite, averaged across all k)

• indicators for the existence of runs of length 2, 3, . . . , 8 in the string

• the total number of ‘1’s in the string

• the index for the first occurrence of ‘1’ in the string

and their interactions up to degree 3. This yields a total of 176 features (including

the intercept). Let x(s) 2 R176 denote the feature vector corresponding to string s,

and define prediction rules

fb(s1:7) = x(s1:7)T b for all s1:7 2 {1, 0}7

cb(s) = x(s)T b for all s 2 {1, 0}8

Then, in the continuation problem, the LASSO coefficient vector b solves

argminb2R176

N

Âi=1

L( fb) + lkbk1,

where kxk1 denotes the l1 norm of the vector b (the sum of the absolute values

of its components). In the classification problem, the LASSO coefficient vector b

68

solves

argminb2R176

N

Âi=1

L(cb) + lkbk1.

Second, we implement a decision tree algorithm using the same feature set.3 We

consider the class of decision trees in which each node considers a single feature,

and each branch corresponds to a different value (or set of values) for the feature.

New inputs are classified by proceeding down the decision tree to a terminal node,

which determines the class.

Table 2.4 below summarizes the prediction errors achieved using LASSO and

decision trees in both problems, and reports also the measure of completeness,

using these alternative prediction errors as benchmarks.

Continuation ClassificationTable lookup 0.2425 0.2430

Completeness using TL as a benchmark 0.1233 0.0857

LASSO 0.2460 0.2484Completeness using LASSO as a benchmark 0.225 0.375

Decision tree 0.2419 0.2434Completeness using decision trees as a benchmark 0.1111 0.9091

Table 2.4: Comparison of prediction error achieved using behavioral models with prediction error achievedusing table lookup.

The LASSO prediction rules yield a (tenfold cross-validated) prediction error

of 0.2460 in the continuation problem and 0.2484 in the classification problem.

Decision trees yield a (tenfold cross-validated) prediction error of 0.2419 in the

continuation problem and 0.2434 in the classification problem.4 Additionally, we

find that the estimated measure of completeness varies from 11% to 23% in the

3Decision trees are a recursive partitioning of the domain space. Formally, a decision tree is arooted tree, in which each node splits the domain space into two or more subspaces according to adiscrete function of the input values.

4The slight improvement of decision trees upon table lookup should be interpreted as noise dueto the finite size of our training data.

69

continuation task, and from 9% to 37% in the classification task, depending on the

choice of algorithm. These results suggest that the benchmark constructed using

table lookup may be approximated by other machine learning algorithms that scale

to problems with more complex representations.

2.3 Transfer across Domains

The previous sections established a benchmark error of roughly 0.24, and discov-

ered that existing behavioral models achieve approximately 10% of the possible

improvement in prediction error (above a naive baseline). How special are these

results to the particular setting that we considered? Should we interpret this

benchmark and measure of completeness as pertaining only to human generation

of eight-length H/T strings, or do they express a more general truth about the

predictability of human generation of random sequences, and the extent to which

our existing models have attained this?

One reason to be concerned is the possibility that behavioral models capture

fundamental aspects of human generation of random sequences (not special to

production of fair coin flips), while table lookup relies on specific features of

generation of eight-length H/T strings that do not generalize to related problems.

For example, 57% of strings begin with ‘1’. The prediction rules derived from Rabin

(2002) and Rabin and Vayanos (2010) don’t leverage this feature for prediction, but

table lookup does. If the predictive accuracy achieved by table lookup is due in

large part to use of features like this, then we should not expect its performance to

generalize.

To address this question, we consider two robustness checks of “transfer pre-

diction" across different framings of the generation problem. Our approach is as

follows. We first estimate the free parameters of all the models (table lookup, and

70

the two behavioral models) on the original dataset of eight-length H/T strings.

Then, we use the estimated models to predict new strings, which are not only out-

of-sample (not used in the estimation of the free parameters), but also produced in a

modified problem domain (the generation problem is framed differently). By looking

at how well the estimated models predict in this new environment, we can assess

whether the features used in table lookup are stable features of misperception

across these various domains, or whether they are specific to the original problem.

We consider two variations in the domain. In Section 3.1, we relabel the possible

outcomes: instead of asking subjects to generate binary strings generated as if

repeatedly flipping a fair coin labelled “Heads" and “Tails", we impose new frames

in which the coin is either labelled “@" on one side and “!" on the other, or “r"

on one side and “2" on the other. In Section 3.2, we change the index of the flip

to be predicted: instead of asking subjects to generate eight repeated coin flips

and predicting the eighth, we ask subjects to generate fifteen coin flips and try to

predict flips 9-15. We find that in these modified prediction problems, the existing

models produce between 7-30% of the improvement in prediction error obtained

using table lookup, providing evidence that the benchmark and ratio discovered

previously are indeed stable across local problem domains.

2.3.1 Prediction of New Alphabets

In the first transfer prediction task, we attempt to predict strings under a relabelling

of the outcome space from {Heads, Tails} to {r, 2}, and to {@, !}. Specifically, we

ask 124 subjects on Mechanical Turk to generate 50 binary strings of length eight

“as if these strings were the realizations of 50 experiments in which a fair coin

labeled ‘r’ on one side and ‘2’ on another was flipped 8 times." We ask another 114

subjects to generate 50 binary strings of length eight “as if these strings were the

71

realizations of 50 experiments in which a fair coin labeled ‘@’ on one side and ‘!’

on another was flipped 8 times".

Following the procedure outlined in Section 2.2, we determine which strings

have an empirical frequency exceeding the 90th percentile, and remove all subjects

who produced 20 or more such strings. We also identify strings with elements

of {1, 0}, mapping ‘r’ and ‘@’ into ‘1’, and ‘2’ and ‘!’ into ‘0.’ We refer to these

datasets of strings respectively as Dr2 and D@!, and the original data as DHT.

The prediction problems we examine are the following. First: Suppose we know

the first seven flips that a subject generated in Dr2 (D@!). How well can we predict

his eighth flip, using only the strings in DHT to train our prediction model? This is

the transfer analogue of the continuation task described in Section 2.4.

Second: Suppose we have a dataset combining the strings in Dr2 (D@!) with an

equal number of strings generated by a Bernoulli(0.5) process. How well can we

separate the human-generated strings from the computer-generated strings, using

only the strings in DHT to train our prediction model? This is the transfer analogue

of the classification task described in Section 2.4.

To answer these questions, we estimate the free parameters in Rabin (2002),

Rabin and Vayanos (2010), and table lookup using DHT, and then use the estimated

models to predict strings in Dr2 and D@!. The resulting prediction errors (reported

below as ten-fold cross validated errors) are listed in Table 2.5, as well as a measure

of completeness, using the table lookup prediction error as a benchmark.

The most important components of the table above are the following. First, we

find that the benchmarks of 0.2425 and 0.2430 discovered previously are incredibly

robust across the different framings: table lookup achieves prediction errors ranging

from 0.2431 to 0.2456. The easure of completeness ranges between 7% and 18%,

and is comparable to the range of nine to 12% found previously.

72

Continuation Classification

{r, 2} {@, !} {r, 2} {@, !}Guessing 50-50 0.25 0.25 0.25 0.25

Rabin (2002) 0.2493 0.2499 0.2499 0.2493(0.0001) (0.0001) (0.0001) (0.0001)

Rabin and Vayanos (2010) 0.2491 0.2497 0.2491 0.2501(0.0001) (0.0001) (0.0001) (0.0001)

Table Lookup 0.2451 0.2456 0.2431 0.2434(0.0001) (0.0001) (0.0001) (0.0001)

Completeness using TL as a benchmark 0.1836 0.0682 0.1304 0.1061

Table 2.5: We train table lookup and our two behavioral models on the original 8-length {H, T} data, andthen use the estimated data to predict 8-length {r, 2} and {@, !} data. Reported prediction errors are tenfoldcross-validated mean squared errors.

2.3.2 Prediction of Subsequent Flips

In the second transfer prediction task, we use the original eight-length strings

to predict strings of length fifteen. The data to be predicted was produced by

asking 120 subjects on Mechanical Turk to generate 25 binary strings of length

fifteen “as if these strings were the realizations of 25 experiments in which a fair

coin was flipped 15 times". From this data, we construct seven “ghost" datasets of

eight-length strings, each including only flips k through k + 7, where k 2 {2, . . . , 8}.

Following the procedure outlined in Section 2.2, we identify the strings whose

empirical frequency exceeded the 90th percentile, and remove all subjects who

produced 20 or more such strings. We again identify strings with elements of

{1, 0}, mapping ‘H’ into ‘1’, and ‘T’ into ‘0.’ The datasets are labeled Dk:k+7, with

k = 2, . . . , 8.

The prediction problems we examine are the following. First: Suppose we know

the first seven flips that a subject generated in Dk:k+7. How well can we predict the

final flip, using only the strings in DHT to train our prediction model? This is the

transfer analogue of the continuation task described in Section 2.4.

73

Second: Suppose we have a dataset combining the strings in Dk:k+7 with an

equal number of strings generated by a Bernoulli(0.5) process. How well can we

separate these strings into those that are human-generated and computer-generated,

using only the strings in DHT to train our prediction model? This is the transfer

analogue of the classification task described in Section 2.4.

To answer these questions, we estimate the free parameters in Rabin (2002),

Rabin and Vayanos (2010), and table lookup using DHT, and then use the estimated

models to predict strings in Dk:k+7, k = 2, . . . , 8. Table 2.6 reports prediction errors

obtained using table lookup, and the two behavioral models. We show two results:

first, the prediction error obtained in dataset D8:15 alone; second, the prediction

error averaged across the seven datasets Dk:k+7, k = 2, . . . , 8.

Continuation Classification

Last flip Average Last flip AverageGuessing 50-50 0.25 0.25 0.25 0.25

Rabin (2002) 0.2500 0.2484 0.2474 0.2479(0.0001) (0.0001) (0.0001) (0.0001)

Rabin and Vayanos (2010) 0.2462 0.2466 0.2484 0.2485(0.0001) (0.0001) (0.0001) (0.0001)

Table Lookup 0.2369 0.2391 0.2409 0.2421(0.0001) (0.0001) (0.0001) (0.0001)

Completeness using TL as a benchmark 0.2900 0.3098 0.2857 0.2772

Table 2.6: We train table lookup and our two behavioral models on the original 8-length {H, T} data,and then use the estimated data to predict the data in {Dk:k+7}8

k=2. Reported prediction errors are tenfoldcross-validated mean squared errors.

We find that the table lookup prediction error ranges from 0.2369 to 0.2421,

which is comparable to the range from 0.2425 to 0.2430 discovered earlier, and that

the measure of completeness ranges from 27% to 31%, greater but comparable to

the previous range of nine to 12%.

74

2.4 Discussion

2.4.1 Guarantees on the benchmark

The problem analyzed in this paper can be reformulated as follows. Let X =

{Xk}k�1 be the “human-generated" {1, 0}-valued random process. If the condi-

tional distribution of X8 given the realizations of X1, . . . , X7 is non-degenerate, then

the 8th flip cannot be predicted perfectly from the first seven (and likewise, realiza-

tions of X cannot be perfectly separated from realizations of a true Bernoulli(0.5)

sequence). We want to compare the prediction error achieved using existing models

not with 0, but with the irreducible error

E(

X8 � f ⇤(X1, . . . , X7))2 (2.2)

where the expectation is with respect to the distribution over {1, 0}8 induced by

{X1, . . . , X8}, and f ⇤ is defined by

f ⇤(x1, . . . , x7) = Pr(X8 = 1|X1 = x1, . . . , X7 = x7)

This is the error that would be achieved by using the true model X to predict

realizations of X8, known in the literature as the Bayes risk. Notice that no function

f : {0, 1}7 ! [0, 1] can improve upon f ⇤ in an expected mean-squared error sense.

The prediction error obtained using table lookup is a consistent estimator of

the Bayes risk: as the quantity of training data approach infinity, the prediction

error achieved using fTL approximates (2.2) to arbitrary precision. However, this

approach, which relies on nonparametric estimation of each Pr(1 | x1, . . . , x7) for

every (x1, . . . , x7) 2 {1, 0}7, is not feasible in problems where the domain space

is substantially larger. In these more general settings, approaches such as those

considered in Section 2.5 (LASSO regression and decision trees) are more viable

75

alternatives. Choice of which algorithm is used to generate the benchmark should

rely on domain knowledge regarding the assumptions the process is likely to

satisfy. We refer the reader to a large literature on estimation of Bayes risk for

theoretical guarantees of different approaches.

2.4.2 Covariates

Throughout this paper, we have considered a fixed set of explanatory covariates,

namely the initial seven flips in the first prediction task, and the string itself in the

second. The notion of completeness we have proposed is more precisely stated as

a measure of completeness for this given set of covariates.

The proposed measure of completeness, therefore, is not instructive towards

which additional covariates (beyond initial flips) might be added to improve

prediction, or how much prediction accuracy can be improved by adding these

covariates. What it does allow us to do is separate two potential reasons for

imperfect prediction: low predictive accuracy because we haven’t yet identified

the most predictive covariates, and low predictive accuracy because our model is

using good covariates in a poor way. For example, suppose we find that benchmark

accuracy is low (say, 55%). This indicates that we may gain from investigating

additional possible covariates, rather than try to improve the current classifier

without changing the feature set. If instead, benchmark accuracy is high (say, 80%),

while our achieved accuracy is low (say, 55%), then there are large gains potentially

to be had in prediction, without the addition of any new features.

2.4.3 Transfer learning

The core of Section 3 is a question of a transfer learning: how much can we

improve prediction of strings generated in a given domain, using knowledge of

76

how strings are generated in a related domain? We focused on transfer learning

across human-generated Bernoulli(0.5) sequences with different outcome spaces

and string lengths, and found that prediction of strings generated for a given

outcome space and string length could be substantially improved using data on

strings generated for a different outcome space and a different string length. This

is reassuring: it suggests that transfer learning is possible across close problems.

But a fuller understanding of the generalizability of machine learning ap-

proaches will require investigation into the extent to which transfer learning is

possible. For example, let us consider a larger class of random processes, in which

the probability of ‘1’ varies, or the size of the outcome space varies. Can we

transfer learn across these more substantive changes in the domain? Concretely:

might we use human-generated Bernoulli(0.5) strings to predict human-generated

Bernoulli(0.6) strings, or human-generated Bernoulli(0.5) strings to predict human-

generated Normal(0,1) strings? We leave this large body of questions for future

work.

2.5 Relationship to Literature

The closest paper to ours is Peysakhovich and Naecker (2016), which independently

proposes the use of machine learning algorithms to provide a benchmark on

obtainable predictive accuracy. The authors focus in this paper on the domain

of choice under uncertainty, assessing the ability of classical models to predict

choices over (risky and ambiguous) lotteries. They use regularized regression as a

benchmark, and find that classical models explain a larger fraction of achievable

prediction error in the domain of risk than in the domain of ambiguity.

The question considered in this paper is also distantly related to classical work

concerning the learnability of a random process. For example, Jackson et al. (1999)

77

derive a Bayesian representation for a stochastic process, with the property that

component distributions are fine enough to be “sufficient for prediction," but coarse

enough so that the components are “learnable." These questions consider asymptotic

properties of the random process, whereas we focus on learning finite properties

of the random process (e.g. the eighth flip of the coin).

Finally, this paper is related to the extensive experimental (Bar-Hillel and

Wagenaar, 1991; Rapaport and Budescu, 1997; Rath, 1966; Edwards; Nickerson and

Butler, 2009; Wagenaar, 1972), empirical (Camerer, 1989; Chen et al.; Gillovich et al.,

1985; Croson and Sundali, 2005), and theoretical (Falk and Konald, 1997; Tversky

and Kahneman, 1971; Barberis et al., 1998; Rabin and Vayanos, 2010) literature on

human misperception of randomness.

2.6 Conclusion

Machine learning has produced techniques that have enabled substantial progress

in various academic disciplines. The question of how these techniques are best

leveraged for advancing research in the social sciences remains open. The purpose

of this paper is to propose one potential use of these techniques: towards assess-

ing theory “completeness." We illustrate this proposal on the simple problem of

predicting human generation of fair coin flips. We show that the prediction error

obtained using the algorithm “table lookup" is a useful benchmark for evaluation of

the success of existing behavioral models—and in particular, that existing models

explain up to 30% of the predictable variation in the problem. Moreover, this

benchmark is robust across related problems, suggesting that machine learning

may provide a generally viable approach to testing theory completeness.

78

Chapter 3

Interpretation of Inconsistent

Choice Data: How Many

Context-Dependent Preferences

are There?

3.1 Introduction

In the simplest model of choice, an individual’s preference is described as a linear

ordering over alternatives in a set X, and his choice from any subset A ✓ X is the

-maximal element in A. It is common to interpret choice data under this model, and

to infer a single ordering best fit to the data. In practice, however, choice datasets

may result from the maximization of several different preferences. For example:

1. Heterogeneity in preference across choice domains. Choice data frequently pools

observations drawn from a variety of choice domains, but agents may have

79

different preferences in different domains. For example, Einav et al. (2012)

study the commonality of financial risk preferences across six choice domains

— including 401(k) asset allocations, short-term disability insurance, and

insurance choices regarding health, drug, and dental expenditures — and find

that just over 30% of their sample makes decisions that can be simultaneously

rationalized over all six domains.

2. Multiple behavioral selves. There is extensive empirical evidence that preference

varies with external details of the choice environment, for example the

framing of the problem (Kahneman and Tversky, 2000), the presence of

default options (Beshears et al., 2008), and the addition of decoy options

(Huber et al., 1982). Outside of the laboratory, it is unusual for external details

to remain constant across every observed choice; choices may therefore reflect

different behavioral biases in different observations.

3. Multiple representative agents. Choice data is often pooled from a population,

across which there may be subpopulations or clusters of individuals with

different preferences. Ideally, agents in different clusters are identifiably

different, but in practice the observable characteristics of these agents may be

indistinguishable (see, for example, Crawford and Pendakur (2012)).

In each of these settings, the analyst may not know beforehand the number of

context-dependent preferences maximized in the data. Since context-dependence is

a persistent feature of preference, knowing the number of contexts is important for

two reasons: 1) Welfare implications—it tells us whether standard welfare analysis is

appropriate for interpretation of the observed choices, or whether there is genuine

preference variation that should be elicited to inform normative statements; 2)

Prediction—it tells us whether to interpret the observed inconsistency as noise, or

80

as systematic variation that can inform prediction of future choices.

The purpose of this paper is to provide a tool for determining the number

of context-dependent preferences maximized in the data. The challenge is that

inconsistencies may reveal genuine context-dependencies, but may also simply

reveal choice error (for example, due to inattention by the subject or measurement

error by the analyst). At extremes, we can rationalize the data using only multi-

plicity in rationales (in which case every observation in conflict is described with

a new ordering) or using only choice error (in which case every observation in

conflict is described as a mistake). How can the analyst recover the true number of

context-dependent preferences from choice data?

This paper proposes a simple approach using regularization1, a statistical

technique in which a penalty is imposed on the complexity of the learned model to

prevent overfitting. Regularization techniques have received tremendous interest

in recent decades in the applied mathematics and computer science communities

due to their ability to recover various sparse structures from noisy or otherwise

corrupted data. Applications range from recovery of signals (Donoho and Huo,

2001; Elad and Bruckstein, 2002) to repair of damaged images and videos (Ren

et al., 2012; Yang et al., 2013) to lyrics and music separation (Huang et al., 2013).2

The sparse structure in my problem is preference: while the agent may possess as

many as |X|! context-dependent-orderings over X, I assume that the number of

1For example, two common regularization techniques for learning regression function f (x) = xbextend the usual OLS approach, min Âi(yi � xib)

2, to:

• min Âi(yi � xib)2+ l||b||2 (ridge regularization), and

• min Âi(yi � xib)2+ l||b||1 (lasso regularization).

Ridge regression penalizes the complexity of the learned model through the l2 norm on the vector ofcoefficients, and lasso regression penalizes the l1 norm.

2These ideas and techniques have recently begun to be applied to economic problems. See forexample Belloni et al. (2011), Belloni and Chernozhukov (2011), Belloni et al. (2012), and Gabaix(2014).

81

orderings he maximizes is much smaller.

I adapt methods from this literature to suggest an “optimal" number of order-

ings to use in describing the data. A best multiple-ordering rationalization (BMOR) is

defined as a solution to the following linear programming program:

argminR2R E(D, R) + l|R|, (3.1)

where R consists of all sets of orderings over X, E(D, R) is the number of observa-

tions in dataset D that are inconsistent with maximization of any ordering in R, and

l 2 R+

is a constant. The program in (3.1) thus maximizes fit (by minimizing the

number of unexplained observations E(D, R)) subject to a penalty on the number

of orderings used (by minimizing |R|), and the constant l trades off between these

two goals.

Notice that the problem in (3.1) nests as special cases two well-known ap-

proaches in the literature. The choice of l = 0 returns the Houtman and Maks

(1985) solution: the ordering that explains the largest number of observations in

the data. The choice of l � 1 returns the Kalai et al. (2002) solution: the smallest

set of orderings that explains all of the data. The problem in (3.1) generalizes these

two approaches by considering intermediate values of l 2 (0, 1).

For what choices of l, and with what guarantee, does the solution to (3.1)

recover features of the agent’s preference? The main part of the paper shows that

if choice data is generated by any in a large class of models, there is an interval of

choices of l for which the approach in (3.1) exactly recovers the correct number of

orderings with probability exponentially close to 1. The class of data-generating

processes I consider is the following. Let F be a (finite) set of K contexts and

let R = { f } f2F be a set of context-dependent preferences.3 A choice problem is a

3In the special case in which the set of contexts is a partition of the power set on X, R describes a

82

pair (A, f ) consisting of a choice set A and a context f . In each choice problem,

the the agent selects the f -optimal alternative in A with probability at least 1 � p;

otherwise, he trembles and selects a different alternative. I do not impose any

parametric assumptions on either the pattern of error or the nature of the orderings.

Taking K = 1 and p = 0 returns the canonical single-ordering model, and taking

K � 1 and p = 0 returns the generalized choice functions proposed independently

in Salant and Rubinstein (2008) and Bernheim and Rangel (2009).

My main result (Theorem 1) shows that if the number of orderings K is suffi-

ciently small (relative to the number of possible orderings, |X|!), the probability p

of erring in each choice is sufficiently low, and the choice implications of the prefer-

ence orderings in R are sufficiently different (see Section 5.2), then the problem in

3.1 recovers the exact number of orderings K with probability exponentially close

to 1 (in quantity of data). Section 5.3 qualifies this result: while we can recover the

number of orderings, it is not in general possible to recover the orderings. I make

preliminary comments towards extension of the approach to recover further details

of preference.

Section 6 is the literature review, and Section 7 concludes.

3.2 Notation

Let X denote the set of choice alternatives, A denote a typical subset of X, and

denote the set of all subsets of X. I refer to A as a choice set, and as the set of all

choice sets. A choice observation (x, A) is a pair denoting selection of alternative x

from choice set A. A dataset is a collection of choice observations

D = {(x, A) | A 2 A},

set of menu-dependent preferences.

83

where A is a (multi)set of elements from .

A strict linear ordering � is a complete, antisymmetric, and transitive binary

relation on X. Let R be the set of all permutations of (1, 2, . . . , N), with typical

element . Identify every linear ordering � with the permutation or preference

ordering = (r1, r2, . . . , rN) 2 R satisfying ri < rj if and only if xi � xj. (For example,

x1 � x3 � x2 is identified with = (1, 3, 2).) Coordinate ri can be interpreted as

the ordinal rank of alternative xi according to the ordering �. The choice function

c :! X induced by takes every choice set A 2 to the -maximal element in A. Say

that dataset D is consistent if there exists an ordering such that (x, A) 2 D only if

x = c(A), and inconsistent otherwise. Two choice observations (x, A) and (x0, A0)

are said to be in violation of the Independence of Irrelevant Alternatives axiom (IIA) if

x, x0 2 A and also x, x0 2 A0, so that they cannot be rationalized by the same strict

ordering.

For a given set of orderings R 2 R, define

C(R) :=n

(x, A) | A ✓ and x 2 {c(

A), 2 R}o

to be the set of all choice observations consistent with maximization of some

ordering in R. I refer to these as the choice implications of the set R. For example, let

X = {x1, x2, x3} and define orderings 1 = (1, 2, 3) and 2 = (1, 3, 2). Then,

C(R) ={(x1, {x1, x2}), (x1, {x1, x3}), (x2, {x2, x3}),

(x3, {x2, x3}), (x1, {x1, x2, x3})}.

3.3 Example

I begin by illustrating ideas on a toy choice dataset, and subsequently describe

the general approach in Section 4. Consider a set of choice alternatives X =

84

{x1, x2, x3, x4} and the dataset D consisting of the following 21 observations:

• Choice of x1 from every subset containing x1.

• Choice of x2 from {x2, x3}, {x2, x4}, and {x2, x3, x4}.

• Choice of x3 from {x3, x4}.

• Choice of x4 from every subset containing x4.

• Choice of x3 from {x1, x3}, {x2, x3}, and {x1, x2, x3}.

• Choice of x2 from {x1, x2}.

• Choice of x3 from {x1, x3, x4}.

There are many possible rationalizations of this data. If the analyst uses a single

ordering to explain the data, he can explain up to 10 observations, for example

with

R1 = {x1 � x2 � x3 � x4}.

The minimal obtainable choice error using a single ordering is D1 = 21 � 10 = 11.

If the analyst allows for two orderings, he can explain 20 observations, for example

with

R2 = {x1 � x2 � x3 � x4, x4 � x3 � x2 � x1}.

The minimal obtainable choice error using two orderings is D2 = 21 � 20 =

1. Finally, if the analyst allows for three orderings, he can explain all of the

observations, for example with

R3 = {x1 � x2 � x3 � x4, x4 � x3 � x2 � x1, x3 � x1 � x2 � x4}.

The minimal obtainable choice error using three orderings is D3 = 0. These

observations are collected in the figure below, which graphs the minimal obtainable

85

choice error Dk for values k = 1, . . . , 5.

21

k

3 Number of Orderings k

Choi

ce E

rror

1

11

How should the analyst choose between these solutions? Notice that the

ordering in R1 is consistent with the proposal of Houtman and Maks (1985), which

finds the largest subset of choice observations that can be explained by a single

ordering4. The set of orderings R3 is consistent with the proposal of Kalai et al.

(2002), which finds the smallest number of orderings that can perfectly explain all

observations.

This paper proposes an intermediate solution. Define the set of best multiple-

ordering rationalizations to be the solution to

argminR2R E(D, R) + l|R|,

choosing l =

10.1|D| =

12.1 , as proposed in Corollary 3. It is easy to verify that all best

multiple-ordering rationalizations consist of two orderings5. This is because the

gain from introducing a second ordering is nearly half of the dataset (D2 �D1 = 10),

whereas the gain from permitting a third is only a single observation (D3 � D2 = 1).

4This solution is not unique. For example x4 � x3 � x2 � x1 also explains 10 observations.

5And moreover, that this outcome holds for any choice of l 2⇣

110 , 1

⌘

.

86

The proposed approach thus interprets the data as reflecting maximization of two

orderings, with a single choice observation in error.

3.4 Approach

Fix a dataset D. The implied choice error when using a set of orderings R to

rationalize D is defined

E(D, R) := |{(x, A) 2 D : x 6= c(A) for all 2 R}|,

i.e. the number of choice observations in D that cannot be explained as maxi-

mization of any preference ordering 2 R. If we restrict to sets of r orderings, the

minimal obtainable choice error is given by

Dr := minR✓R, |R|=r

E(D, R).

Say that D is r-rationalizable if Dr = 0.

Remark 12 Dataset D is 1-rationalizable if and only if it is consistent.

Remark 13 For every dataset D, there exists a constant L min{|D|, |X|} such that D

is r-rationalizable for every r � L.

To allow for a tradeoff between the “simplicity" of the model (as defined

through the number of orderings) and its fit to the data, I define the set of l-best

multiple-ordering rationalizations of D as follows.

Definition 8 For any l 2 R+

, R⇤ is a l-BMOR of D if

R⇤ 2 argminR✓R

|R|+ lE(D, R). (3.2)

We can interpret l as arbitrating between the two goals of minimizing the number

of orderings and minimizing the number of implied choice errors. Loosely speaking,

87

an ordering is included in R if and only if it explains at least 1l observations that

would otherwise be interpreted as choice error. Thus, as l ! 0, the analyst prefers

to adopt a unique ordering for the agent and interpret the remaining observations

as error, while for large choices of l, the analyst prefers to use as many orderings

as necessary to eliminate choice error. Below, I show that the Houtman and Maks

(1985) solution is selected if l < 1D1

(Claim 4), and the Kalai et al. (2002) solution is

selected if l > 1 (Claim 5).

Claim 4 If l < 1D1

, every solution R⇤ to Eq. (3.2) satisfies |R⇤| 1.

Proof 7 Suppose there exists some l-BMOR R⇤ satisfying |R⇤| = K > 1. Then by the

definition of a l-BMOR, necessarily K + lDK 1 + lD1. Since moreover 1 + lD1 < 2,

it follows that K < 2 � lDK 2. This contradicts the assumption that K > 1.

Claim 5 If l > 1, every solution R⇤ to Eq. (3.2) satisfies |R⇤| = L, where L is the smallest

constant such that the data is L-rationalizable.

Proof 8 Since D is L-rationalizable, clearly no solution R⇤ to Eq. (3.2) will have |R⇤| > L.

Suppose there exists a l-BMOR R⇤ with |R⇤| = K < L. Assign a unique ordering to

each of the DK implied choice errors from rationalizing D with R⇤. This perfectly explains

the data with K + DK < K + lDK orderings, contradicting the assumption that R⇤ is a

l-BMOR.

Geometrically, solutions to (3.2) are described as follows. Let f be the linear

interpolation of points {(k, Dk), k 2 N}, and let F = {(x, y) | y � f (x)} be the

epigraph of f . Then, for any choice l 2 R+

, the problem in (3.2) returns a k-

ordering solution if and only if the line with normal vector (�1,�l) supports F at

(k, Dk). For example, given the dataset provided in Section 3, the line with normal

vector (�1,�l) supports F at (2, D2) for any choice of l 2 � 110 , 1

�

.

88

F

f

2

21

0

3 Number of Orderings

Choi

ce E

rror

Figure 3.1: The problem in (3.2) returns a solution with 2 orderings if and only if the line with normalvector (�1,�l) supports F at (2, D2).

How should the analyst choose l, and under what conditions on the data-

generating process do we find the proposed approach to recover the “true" number

of orderings maximized by the agent?

3.5 Recovery Results

Consider the following class of choice rules. Let F be a set of K contexts, R = { f } f2F

be a set of context-dependent preferences, and p 2 [0, 1] be a probability of

error. Given choice problem (A, f ), suppose that the agent chooses the f -optimal

alternative in A with probability at least 1 � p; otherwise, he trembles and chooses

a different alternative in A. Assume that the analyst does not know

1. the number of contexts.

2. the locations or number of realized errors.

3. the distribution of error.

89

Can he recover the true number of orderings K using the proposed approach in

(3.2)?

In general, recovery of the number of context-dependent preferences will not

be possible. My main result demonstrates, however, that the proposed method will

exactly recover K with probability exponentially close to 1 if: (1) the number of

preferences is small relative to the quantity of data, (2) the probability of error is

small, and (3) the context-dependent preferences are “sufficiently different," in a

sense made precise in Section 5.2.

A natural next question is whether we can do better: in particular, can we

recover either the locations of mistakes, or the set of orderings themselves? In

Section 5.3, I show that even in the absence of choice error (p = 0), recovery

of multiple preferences from choice data alone is in general an ill-posed task.

Proposition 7 states that no sets of three or more context-dependent orderings, and

only special pairs of context-dependent orderings, can be recovered.

3.5.1 Class of choice models

In this section, I show that the class of choice models I consider is fairly general,

including as special cases several familiar models of choice. To describe these

relationships, recall that a random choice rule is a map P from choice problems

(A, f ) to distributions over the elements of A, with the property that the support

of P(A, f ) is a subset of A for every (A, f ). The class of choice models I consider is

equivalent to the class of random choice rules satisfying

P⇣

x(A, f ) | A, f

⌘

� 1 � p for all (A, f ).

where x(A, f ) denotes the f -optimal choice in A. The following special cases are of

interest.

90

Case 1: p = 0 and K = 1. This returns the classic theory of choice in which a

single preference ordering is defined over X and choice satisfies

P(x|A, f ) =

8

>

<

>

:

1 if x = c(

A)

0 otherwise.for all (A, f ) 2 P

where the dependence on f is trivial.

Case 2: p = 0 and K � 1. This returns the model proposed in Salant and

Rubinstein (2008) and Bernheim and Rangel (2009), in which an agent is described

by a set of context-dependent orderings {r f } f2F, and choice satisfies

P(x|A, f ) =

8

>

<

>

:

1 if x = x(A, f )

0 otherwise.for all (A, f ) 2 P .

In the special case in which F is a partition of , this is equivalent to the model

studied in Kalai et al. (2002): a map h :! F cues contexts from choice sets, and

choice from A is the h(A)

-optimal alternative in A.

Case 3. p > 0 and K = 1. If there exist distributions {qA}A2 such that

P(x|A, f ) = qA(Rx,A) for all (A, f ) 2 P ,

then this returns the class of random utility models with choice-set dependent

distributions (as considered, for example, in Fudenberg et al. (2015)). In the special

case in which qA = q for every choice set A, we have the classic random utility

model (Block and Marshak, 1960).

3.5.2 Can we recover the number of orderings?

I provide intuition for the subsequent results by discussing a few negative examples

in which recovery of K using the approach in (3.2) is either impossible or difficult.

In both examples, I fix p = 0 to simplify ideas.

91

Example 8 Define the sets of orderings

R = {(1, 2, 3), (1, 3, 2), (3, 2, 1)}, and

R0= {(1, 2, 3), (3, 2, 1)}.

Suppose the agent’s true set of context-dependent preferences is described by R. It is

easy to verify6 that C(R) = C(R0), implying that every choice observation consistent

with maximization of some ordering in R is also consistent with maximization of some

ordering in R0. Then, for every choice of l and any choice dataset D generated by (perfectly)

maximizing orderings in R, we have that

E(D, R0) + l|R0| = 0 + 2l < E(D, R00

) + l|R00|

for every R00 consisting of three or more orderings. Thus, no solution of (3.2) will return

the true number of orderings (3) in the agent’s preference.7

Example 9 Define the sets of orderings

R = {(1, . . . , 9, 10), (1 . . . , 10, 9)}, and

R0= {(1, . . . , 9, 10)}.

Suppose the agent’s true set of context-dependent preferences is described by R. Observe

that the single ordering in R0 explains almost every choice observation that can result from

maximization of orderings in R. The single exception is the observation (x10, {x9, x10}),which is consistent with maximization of the ordering (1, . . . , 10, 9), but not with maxi-

mization of the ordering (1, . . . , 9, 10). Consider any choice dataset D that is generated

6C(R) = C(R0) = {(x1, {x1, x2, x3}), (x1, {x1, x2, }), (x1, {x1, x3}), (x2, {x2, x3}) (x2, {x1, x2}),

(x3, {x2, x3}), (x3, {x1, x3}), (x3, {x1, x2, x3})}.

7An alternative perspective is to consider R and R0 not merely observationally equivalent, butin fact equivalent models. In this case, we can re-interpret the observation as follows: the “morecomplex" description R will never be selected over the “less complex" description R0.

92

by (perfectly) maximizing orderings in R, and does not include (x10, {x9, x10}). Then, for

any choice of l,

E(D, R0) + l|R0| = 0 + l < E(D, R00

) + l|R00|

for every R00 consisting of two or more orderings. So observation of (x10, {x9, x10}) is

necessary to return the true number of orderings (2) using (3.2).

These examples are suggestive of the following: in order to recover the number

of context-dependent orderings, it is necessary that there exist “sufficient differen-

tiation" in the choice implications of these orderings. Following, I define such a

notion of differentiation.

Definition 9 Say that choice problems

A = {(A1, f1), . . . , (Ak, fk)}

are in k-violation of IIA if

1. c f (A) 6= cf 0 (A0

) for every (A, f ), (A0, f 0) 2 A, and

2. c f (A) 2 Tki=1 Ak for every (A, f ) 2 A.

Condition (1) requires that every choice problem in A has a distinct optimal choice,

and condition (2) requires that each of these (distinct) k alternatives is available

in every set Ai for i = 1, . . . , k. Notice that every pair of choice problems from Aconstitutes a (standard) violation of IIA.

Definition 10 The differentiation parameter dR() of a (multi)set of choice problems

A is the largest d such that there exists a partition of A into subsets {A1, . . . ,Ad+1}satisfying

1. |Ai| = K, and

93

2. Ai is in K-violation of IIA.

for every i 2 {1, . . . , d}.

I illustrate this definition on an example.

Example 10 Consider a set of choice alternatives X = {x1, x2, x3, x4, x5}, and a set of

frames F = { fp, fq}. The agent has context-dependent preferences

R = {p,q } = {(5, 4, 3, 2, 1), (1, 2, 3, 4, 5)}.

Suppose the agent maximizes q when there are three or fewer alternatives in the choice set,

and maximizes p otherwise. Let A consist of every choice problem (A, f ) with A 2 and

f 2 F. Then dR(A) = 6, with every pair in

({x1, x2}, fq), ({x1, x2, x3, x4}, fp)

({x1, x3}, fq), ({x1, x2, x3, x5}, fp)

({x1, x4}, fq), ({x1, x2, x4, x5}, fp)

({x1, x5}, fq), ({x1, x3, x4, x5}, fp)

({x2, x3}, fq), ({x2, x3, x4, x5}, fp)

({x2, x4}, fq), ({x1, x2, x3, x4, x5}, fp)

constituting a 2-violation of IIA.

The subsequent recovery results (informally) say the following: if there is

sufficient differentiation between context-dependent orderings and sufficient (not

necessarily complete) sampling of choice problems, then we can recover the number

of context-dependent orderings using Equation (3.2) with probability very close to

1. Theorem 1 applies to general sets of choice problems. Corollary 2 considers a

particular data-generating process in which M choice problems are sampled uni-

formly at random from P . Throughout, I take M to be the number of observations

94

in the data, N to be the number of alternatives, p to be the probability of error,

and dR(A) to be the differentiation parameter of A given the agent’s preferences

R. When there is no chance of confusion, I express dR() simply as d().

Theorem 3 Let be any (multi)set of M choice problems. Suppose, for some constant

b > 0,

d := d() >2p + b

(1 � p)K M (3.3)

Then for any d 2⇣

0, d(1�p)K

M � 2p � b⌘

, there exists a constant c > 1 such that the

optimization problem in Eq. (3.2) with l =

1(p+d)M exactly recovers |R| = K with

probability at least 1 � O�

c�M�.

I provide a brief proof sketch here and defer the details to the appendix. Identify

every dataset with an undirected (hyper)graph8 in the following way: nodes

represent choice alternatives, and there is an edge between a set of observations if

and only if these observations cannot be rationalized using the same preference

ordering. The key observation in the proof is that a dataset is k-rationalizable if

and only if the corresponding graph is k-colorable9. This equivalence is shown by

taking each color class to represent consistency with a distinct ordering. Thus, the

problem in 3.2 can be seen as finding the smallest number of colors k such that the

greatest number of nodes are k-colorable.

Fix any (multi-)set of choice problems A and suppose for the moment that

there is no choice error. Since the data is generated by perfect maximization of K

orderings, the corresponding hypergraph admits a K-coloring. Moreover, notice

that every set of observations in a K-violation of IIA constitutes a complete K-

partite subgraph. Since by assumption, the data includes at least d such sets, the

8A hypergraph is a generalization of a graph in which edges may connect more than two vertices.

9A k-coloring of a graph is a partition of its vertex set V into k color classes such that no edge inE is monochromatic. A graph is k-colorable if it admits an k-coloring.

95

Consistent with maximization of r1Consistent with maximization of r2Consistent with maximization of r3

Figure 3.2: Studying rationalizability of a dataset is equivalent to studying colorability of a graph in whichnodes represent observations and edges represent inconsistencies.

corresponding hypergraph includes at least d complete K-partite subgraphs. So it

cannot be colored by fewer than K colors.

Now introduce choice error. Each node is “corrupted" with probability p,

following which its edges are re-arranged (the node is removed from some edges

to which it belongs, and new edges between this node and others are introduced). I

show that if choice error is introduced at a sufficiently low probability, then enough

complete K-partite graphs remain in the perturbed graph such that “most" nodes

can be partitioned into K (but not fewer) colors.

The following corollary presents recovery properties for a particular, convenient,

choice of l.

Corollary 3 Let be any (multi)set of M choice problems. Suppose p 0.05 and d(A) >

0.25(0.95)K M. The optimization problem in Eq. (3.2) with l =

10.1M exactly recovers |R| = K

with probability at least 1 � O�

e�0.005M�.

Since the number of non-overlapping sets of size K from a set of M elements

96

cannot exceed MK , Condition (3.3) in Theorem 1 implies a tradeoff between the

number of orderings that can be recovered and the probability of error that can

be tolerated. For example, if p = 0.05 (5% probability of error) the theorem does

not apply to sets with more than 6 orderings, and if p = 0.01 (1% chance of error),

the theorem does not apply to sets with more than 35 orderings. In Corollary 1,

the (stronger) requirement d() > 0.250.95K M is satisfied only by sets including three or

fewer orderings. These strict thresholds are not necessary conditions for recovery,

and can be relaxed in future work10.

Below, I illustrate the implications of this approach and choice of l using a

problem of preference elicitation considered in Crawford and Pendakur (2012).

Example 11 Crawford and Pendakur (2012) study preferences over six different types of

milk, using a dataset including 500 Danish households and their purchases. Unsurprisingly,

no single utility function can explain all 500 observations. To accommodate heterogeneity

in preference, Crawford and Pendakur (2012) suggest an application of Kalai et al. (2002),

in which the minimal number of utility functions that explain all of the data is found. They

find that in fact 4-5 utility functions are sufficient to explain all of the data. This is a perfect

multiple-preference fit to the data.

But they further comment that the fifth utility function explains only 8 out of 500

observations, and drop this utility function in many of their latter analyses. This highlights

a limitation of the approach proposed in Kalai et al. (2002): the approach ignores variation

in the strength of evidence for recovered preferences.

We can extend the approach in Kalai et al. (2002) by finding a “best" imperfect multiple-

preference fit. Under the choice of l =

10.1(500) =

150 proposed in Corollary 3, preferences

exist in a best multiple-ordering rationalization only if they uniquely explain at least 50

10One way to relax this restriction is to count the number of possibly overlapping sets of choiceproblems that constitute an K-violation of IIA.

97

observations. The fifth utility function does not satisfy this criterion, so the remaining 8

observations are interpreted as choice error.11

Corollary 4 considers a related context in which the set of choice problems is

not fixed by the analyst, but generated by uniform sampling over the set of possible

choice problems P = {(A, f ) : A 2, f 2 F}. A similar result obtains provided the

differentiation parameter d(P) is sufficiently high.

Corollary 4 Suppose consists of M choice problems sampled uniformly at random from

P , and

d := d() >2p + b

(1 � p)K

⇣

2NK⌘

(3.4)

for some constant b > 0. Then for any d 2⇣

0, d(1�p)K

M � 2p � b⌘

, there is a constant

c > 1 such that the optimization problem in Eq. (3.2) with l =

1(p+d)M exactly recovers

|R| = K with probability at least 1 � O�

c�M�.

The details of the proof are deferred to the appendix.

3.5.3 Can we recover more?

Section 4.1 provides conditions under which the problem in (3.2) recovers the

correct number of context-dependent orderings with high probability. Is it possible

to recover the context-dependent orderings themselves? First, I show that even in

the absence of choice error (p = 0), recovery of multiple preferences from choice

data is in general an ill-posed task. From Proposition 7, no set of three or more

context-dependent orderings can be uniquely recovered, and only special pairs of

context-dependent orderings can be recovered.

11It is important to note, however, that Crawford and Pendakur (2012) consider utility functionsdefined on the continuous space of price-quantity pairs, whereas my recovery results in Section 5pertain to orderings defined on finite sets of alternatives.

98

Next, I discuss the possibility of identifying an equivalence class (in choice

implications) containing the true set of context-dependent orderings. I provide an

example to illustrate that even this weaker notion of identifiability is not met. I

suggest that a more nuanced notion of complexity than cardinality is needed for

recovery of sets of context-dependent orderings, and conclude with preliminary

comments toward extension.

In the following, I say that R is identified if there exists data D such that

{R} = argminR02R

�|R0| | E(D, R0) = 0

. (3.5)

That is, there exists some set of choice observations (possibly including observation

of multiple choices from the same choice set) such that R is the unique set of k |R|orderings that perfectly explains the data. We can think of the data as “revealing" R

to the data analyst. Otherwise, say that R is not identified. The following observation

provides an equivalent characterization.

Observation 4 R is identified if and only if there does not exist R0 6= R satisfying

|R0| |R| and E(C(R), R0) = 0.

Thus, if there exists any data which identifies R, then the dataset C(R) will

identify R.

Example 12 Fix X = {x1, x2, x3} and R = {(3, 2, 1), (3, 1, 2)}. The choice implications

of R

C(R) ={(x1, {x1, x2}), (x1, {x1, x3}), (x2, {x2, x3}),

(x3, {x2, x3}), (x1, {x1, x2, x3})}

99

is a subset of the choice implications of R0, for every

R0= {{(3, 2, 1), (2, 1, 3)}, {(3, 2, 1), (1, 2, 3)},

{(2, 3, 1), (3, 1, 2)}, {(1, 3, 2), (3, 1, 2)}}.

So every dataset that can be perfectly rationalized by the orderings in R can be perfectly

rationalized by the orderings in R0 2 R. Therefore, R is not identified.

The following proposition establishes non-identifiability of most non-singleton

sets. Specifically, every set with at least three orderings is not identified, and a

pair of orderings is identified if and only if the orderings differ at extremes (the

maximal element in the first ordering is the minimal element in the second, and

vice versa).

Proposition 7 If |R| � 3, then R is not identified. If R = {1,2 }, then R is identified if

and only if argmaxi rji = argmini r3�j

i for j = 1, 2. Every set R = {} is identified.

The proof is deferred to the appendix. Proposition 1 leaves open the possibility

of recovering an equivalence class (in choice implications) containing the true set

of orderings. Say that R ⇠ R0 if C(R) = C(R0), and let

[R] = {R0 2 R | R0 ⇠ R}

be the equivalence class of R induced by ⇠. Every set of orderings in [R] has the

same choice implications as R, so that an observation can be rationalized using an

ordering in R if and only if it can be rationalized using an ordering in R0. Say that

R is choice-identified if there exists data D such that

[R] = argminR02R

�|R0| | E(D, R0) = 0

.

That is, there exists some set of choice observations D such every observation in D

100

is consistent with maximization of some ordering in R0 if and only if C(R0) = C(R).

The following negative example shows that even this weaker kind of identifiability

need not be satisfied by typical sets of orderings.

Example 13 Define

R = {(1, 2, 3, 4, 5), (2, 1, 3, 4, 5)}, and

R0= {(1, 2, 3, 4, 5), (5, 4, 3, 2, 1)}

Every choice observation consistent with maximization of some ordering in R is also

consistent with maximization of some ordering in R0, but the converse does not hold. So

C(R) ⇢ C(R0), implying both that R0 /2 [R], and also that every dataset that can be

perfectly rationalized using the orderings in R can be perfectly rationalized using the

orderings in R0. It follows that [R] is not choice-identified.

The example above highlights the (general) phenomenon that across sets with a

fixed number of orderings, there is variation in the “richness" of choice implication.

Since the approach in (3.2) penalizes all sets consisting of the same number of

orderings equally, it is biased towards elicitation of sets with richer choice implica-

tions. That is, if the decision maker’s context-dependent preferences are many but

similar, the proposed approach will incorrectly interpret the data using orderings

that are fewer but “more different". It may be possible to recover equivalence

classes by extending the approach in (3.2) to loss functions of the form

E(D, R) + l f (R),

where f penalizes the “richness" or “expressiveness" of the orderings in R, instead

of the cardinality. I leave this extension for future work.

101

3.6 Relationship to Literature

This paper extends ideas in Kalai et al. (2002), which defines a set of orderings

{�i}Li=1 as a rationalization by multiple rationales of choice function c :! X if for

every choice set A 2, the selected alternative c(A) is �i-maximal in A for some

i = 1, 2, . . . , L. Using the notation of Section 2, any set of orderings R with choice

error E(D, R) = 0 is a rationalization by multiple rationales of the dataset D.

This set of orderings may not, however, correspond to a best multiple-ordering

rationalization of the data as defined in (3.2). In particular, I suggest that the analyst

may prefer an imperfect rationalization of the data using some K < L orderings to

perfect rationalization of the data using L orderings. The key conceptual difference

is that Kalai et al. (2002) is agnostic towards the “degree of evidence" for any

particular ordering �k, whereas the approach in this paper insists on sufficient

evidence for each ordering in order to separate error from preference variation.

The model of choice that I consider throughout is an extension of frame-

dependent preferences proposed independently in Salant and Rubinstein (2008)

and Bernheim and Rangel (2009), with the addition of choice error. In each of these

papers, the standard model is enriched by a set F of contexts.12 A choice problem13

is defined as a pair (A, f ) where A ✓ X is a choice set and f 2 F is a context. An

extended choice function c assigns to every extended choice problem (A, f ) an

element of A. I consider an extension of this model to allow for probability of error,

so that c(A, f ) is chosen with probability at least 1 � p, but with probability p the

agent trembles.

Although my model of choice is very similar, the goals of this paper are very

12Frames in Salant and Rubinstein (2008), and ancillary conditions in Bernheim and Rangel (2009)

13Extended choice problem in Salant and Rubinstein (2008), and generalized choice situation in Bernheimand Rangel (2009).

102

different. Salant and Rubinstein (2008) characterizes the choice correspondence

Cc(A) = {x | c(A, f ) = x for some f 2 F}, and Bernheim and Rangel (2009) pro-

poses a framework for welfare assessment. This paper studies the question of

whether it is possible to recover the number of contexts in F, using choice data

alone. My results in Section 5 show that recovery of context-dependent preferences

given the model proposed in Salant and Rubinstein (2008) and Bernheim and

Rangel (2009) is an ill-posed problem, but recovery of the number of contexts is

possible even under choice error.

This paper is related also to Ambrus and Rozen (2013), which shows that

without prior restriction on the number of selves involved in a decision, many

multi-self models have no testable implications. Although the set of choice models

considered in Ambrus and Rozen (2013) is distinct from the set of choice models

considered in my paper,14 their lesson that restricting the number of selves is

important for constraining the available degrees of freedom holds in my domain

as well, and motivates in part the suggested criterion in (3.2).

Finally, the applications that I suggest are related to exercises undertaken in

Crawford and Pendakur (2012) and Dean and Martin (2010), which respectively

apply the approaches of Kalai et al. (2002) and Houtman and Maks (1985) to

interpret inconsistent choice data.

Conclusion

Inconsistencies in choice data may emerge either from choice error, or from maxi-

mization of multiple orderings. It is important to separate the two in analysis of

14Ambrus and Rozen (2013) study multi-self models in which every self is active in every decision,and choice is determined through maximization of a choice-set independent aggregation rule overselves. In contrast, I study multi-self models in which every self acts as a “dictator" in a subset ofchoices, thus varying the aggregation rule across choice problems.

103

the data, since their implications for welfare assessment and prediction are very

different. But how does the analyst know how many distinct orderings are being

maximized? This paper suggests use of statistical regularization to recover a small

number of context-dependent preferences from noisy choice data. I show that with

probability exponentially close to 1, the proposed approach is able to recover the

true number of context-dependent preferences. This provides an alternative to

existing approaches, which deliver either a single “best-fit" ordering or multiple

“perfect-fit" orderings.

104

References

Acemoglu, D., Chernozhukov, V. and Yildiz, M. (2015). Fragility of asymptoticagreement under bayesian learning. Theoretical Economics.

Al-Najjar, N. (2009). Decisionmakers as statisticians: Diversity, ambiguity, andlearning. Econometrica.

Ambrus, A. and Rozen, K. (2013). Rationalizing choice with multi-self models.Economic Journal.

Aumann, R. J. (1976). Agreeing to disagree. The Annals of Statistics.

Bar-Hillel, M. and Wagenaar, W. (1991). The perception of randomness. Advancesin Applied Mathematics.

Barberis, N., Shleifer, A. and Vishny, R. (1998). A model of investor sentiment.Journal of Financial Economics.

Battigalli, P. and Sinischalchi, M. (2003). Rationalization and incomplete infor-mation. Advances in Theoretical Economics.

Belloni, A. and Chernozhukov, V. (2011). L1-penalized quantile regression inhigh-dimensional sparse models. Annals of Statistics.

—, —, Chen, D. and Hansen, C. (2012). Sparse models and methods for instru-mental regression, with an application to eminent domain. Econometrica.

—, — and Wang, L. (2011). Square-root lasso: Pivotal recovery of sparse signalsvia conic programming. Biometrika.

Bernheim, B. D. and Rangel, A. (2009). Beyond revealed preference: Choicetheoretic foundations for behavioral welfare economics. Quarterly Journal ofEconomics.

Beshears, J., Choi, J., Laibson, D. and Madrian, B. (2008). The Importance of DefaultOptions for Retirement Saving Outcomes: Evidence from the United States, OxfordUniversity Press, pp. 59–87.

105

Billot, A., Gilboa, I., Samet, D. and Schmeidler, D. (2005). Probabilities assimilarity-weighted frequencies. Econometrica.

Block, H. and Marshak, J. (1960). Random orderings and stochastic theories ofresponse. Contributions to Probability and Statistics.

Brandenburger, A. and Dekel, E. (1993). Hierarchies of belief and commonknowledge. Journal of Economic Theory.

Camerer, C. (1989). Does the basketball market believe in the ‘hot hand’? AmericanEconomic Review.

Carlin, B. I., Kogan, S. and Lowery, R. (2013). Trading complex assets. The Journalof Finance.

—, Longstaff, F. A. and Matoba, K. (2014). Disagreement and asset prices. Journalof Financial Economics.

Carlsson, H. and van Damme, E. (1993). Global games and equilibrium selection.Econometrica, 61 (5), 989–1018.

Chen, D., Moskowitz, T. and Shue, K. (). Decision-making under the gambler’sfallacy: Evidence from asylum judges, loan officers, and baseball umpires,working Paper.

Chen, Y.-C., di Tillio, A., Faingold, E. and Xiong, S. (2010). Uniform topologieson types. Theoretical Economics.

Crawford, I. and Pendakur, K. (2012). How many types are there? EconomicJournal.

Cripps, M., Ely, J., Mailath, G. and Samuelson, L. (2008). Common learning.Econometrica.

Croson, R. and Sundali, J. (2005). The gambler’s fallacy and the hot hand: Empir-ical data from casinos. Journal of Risk and Uncertainty.

Dean, M. and Martin, D. (2010). How consistent are your choice data?, workingPaper.

Dekel, E., Fudenberg, D. and Levine, D. (2004). Learning to play bayesian games.Games and Economic Behavior.

—, — and Morris, S. (2006). Topologies on types. Theoretical Economics.

—, — and — (2007). Interim correlated rationalizability. Theoretical Economics.

Donoho, D. L. and Huo, X. (2001). Uncertainty principles and ideal atomic decom-position. IEEE Transactions on Information Theory.

106

Edwards, W. (). Probability learning in 1000 trials. Journal of Experimental Psychology.

Einav, L., Finkelstein, A., Pascu, I. and Cullen, M. (2012). How general are riskpreferences? choices under uncertainty in different domains. American EconomicReview.

Elad, M. and Bruckstein, A. (2002). A generalized uncertainty principle andsparse representation. IEEE Transactions on Information Theory.

Esponda, I. (2013). Rationalizable conjectural equilibrium: A framework for robustpredictions. Theoretical Economics.

Eyster, E. and Piccione, M. (2013). An approach to asset pricing under incompleteand diverse preferences. Econometrica.

Falk, R. and Konald, C. (1997). Making sense of randomness: Implicit encodingas a basis for judgment. Psychological Review.

Fudenberg, D., Iijima, R. and Strzalecki, T. (2015). Stochastic choice and revealedperturbed utility. Econometrica.

—, Kreps, D. and Levine, D. (1988). On the robustness of equilibrium refinements.Journal of Economic Theory.

Gabaix, X. (2014). A sparsity-based model of bounded rationality, with applicationto basic consumer and equilibrium theory. Quarterly Journal of Economics, workingPaper.

Gayer, G., Gilboa, I. and Lieberman, O. (2007). Rule-based and case-based reason-ing in housing prices. The B.E. Journal of Theoretical Economics.

Geanakoplos, J. and Polemarchakis, H. (1982). We can’t disagree forever. Journalof Economic Theory.

Gilboa, I., Lieberman, O. and Schmeidler, D. (2006). Empirical similarity. Reviewof Economics and Statistics.

—, Samuelson, L. and Schmeidler, D. (2013). Dynamics of inductive inference ina unified framework. Journal of Economic Theory.

— and Schmeidler, D. (2003). Inductive inference: An axiomatic approach. Econo-metrica.

Gillovich, T., Vallone, R. and Tversky, A. (1985). The hot hand in basketball: Onthe misperception of random sequences. Cognitive Psychology.

Hansen, L. P. (2014). Uncertainty inside and outside economic models. Journal ofPolitical Economy.

107

— and Sargent, T. (2010). Fragile beliefs and the price of uncertainty. QuantitativeEconomics.

— and — (2012). Three types of ambiguity. Journal of Monetary Economics.

— and Sargent, T. J. (2007). Robustness. Princeton University Press.

Houtman, M. and Maks, J. (1985). Determining all maximal data subsets consistentwith revealed preference. Kwantitatieve Methoden, 19, 89–104.

Huang, P.-S., Chen, S. D., Smaragdis, P. and Hasegawa-Johnson, M. (2013).Singing-voice separation from monaural recordings using robust principal com-ponent analysis. International Conference on Acoustics, Speech, and Signal Processing.

Huber, J., Payne, J. and Puto, C. (1982). Adding asymmetrically dominatedalternatives: Violations of regularity and the similarity hypothesis. Journal ofConsumer Research.

Jackson, M. O., Kalai, E. and Smorodinsky, R. (1999). Bayesian representation ofstochastic processes under learning: De finetti revisited. Econometrica.

Kahneman, D. and Tversky, A. (2000). Choices, Values, and Frames. The PressSyndicate of the University of Cambridge.

Kajii, A. and Morris, S. (1997). The robustness of equilibria to incomplete infor-mation. Econometrica.

Kalai, G., Rubinstein, A. and Spiegler, R. (2002). Rationalizing choice functionsby multiple rationales. Econometrica, 70 (6), 2481–2488.

Kandel, E. and Pearson, N. (1995). Differential interpretation of information andtrade in speculative markets. Journal of Political Economy.

Kohlberg, E. and Mertens, J.-F. (1986). On the strategic stability of equilibria.Econometrica.

Mankiw, G., Reis, R. and Wolfers, J. (2004). Disagreement about inflation expecta-tions. NBER Macroeconomics Annual 2003.

Marlon, J., Leiserowitz, A. and Feinberg, G. (2013). Scientific and Public Perspec-tives on Climate Change. Tech. rep., Yale Project on Climate Change Communica-tion.

Mertens, J.-F. and Zamir, S. (1985). Formulation of bayesian analysis for gameswith incomplete information. International Journal of Game Theory.

Monderer, D. and Samet, D. (1989). Approximating common knowledge withcommon beliefs. Games and Economic Behavior.

108

Morris, S., Rob, R. and Shin, H. S. (1995). p-dominance and belief potential.Econometrica.

— and Takahashi, S. (). Strategic implications of almost common certainty ofpayoffs.

—, — and Tercieux, O. (2012). Robust rationalizability under almost commoncertainty of payoffs. The Japanese Economic Review.

Myerson, R. B. (1978). Refinements of the nash equilibrium concept. InternationalJournal of Game Theory.

Nickerson, R. and Butler, S. (2009). On producing random sequences. AmericanJournal of Psychology.

Peysakhovich, A. and Naecker, J. (2016). Evaluating models of choice under riskand ambiguity using methods from machine learning, working Paper.

Rabin, M. (2002). Inference by believers in the law of small numbers. The QuarterlyJournal of Economics.

— and Vayanos, D. (2010). The gambler’s and hot-hand fallacies: Theory andapplications. Review of Economic Studies.

Rapaport, A. and Budescu, D. (1997). Randomization in individual choice behavior.Psychological Review.

Rath, G. (1966). Randomization by humans. The American Journal of Psychology.

Ren, X., Zhang, Z. and Ma, Y. (2012). Repairing sparse low-rank texture. Journal ofLaTeX Class Files.

Rubinstein, A. (1989). The electronic mail game: Strategic behavior under "almostcommon knowledge". American Economic Review, 79.

Salant, Y. and Rubinstein, A. (2008). (a,f): Choice with frames. The Review ofEconomics Studies.

Selten, R. (1975). Reexamination of the perfectness concept for equilibrium pointsin extensive games. International Journal of Game Theory.

Steiner, J. and Stewart, C. (2008). Contagion through learning. Theoretical Eco-nomics.

Tversky, A. and Kahneman, D. (1971). The belief in the law of small numbers.Psychological Bulletin.

Wagenaar, W. (1972). Generation of random sequences by human subjects: Acritical survey of the literature. Psychological Bulletin.

109

Weinstein, J. and Yildiz, M. (2007). A structure theorem for rationalizability withapplication to robust prediction of refinements. Econometrica.

Wiegel, D. (2009). How many southern whites believe obama was born in america?Washington Independent.

Yang, A., Zhou, Z., Ganes, A., Sastry, S. S. and Ma, Y. (2013). Fast l1-minimizationalgorithms for robust face recognition. Computer Vision and Pattern Recognition.

Yu, J. (2011). Disagreement and return predictability of stock portfolios. Journal ofFinancial Economics.

110

Appendix A

Appendix to Chapter 1

A.1 Notation and Preliminaries

• If (X, d) is a metric space with A ✓ X and x 2 X, I write

d(A, x) = supx02A

d(x0, x).

• Int(A) is used for the interior of the set A.

• Recall that u 2 U is a payoff matrix. For clarity, I will sometimes write ui to

denote the the payoffs in u corresponding to agent i, and u(a, q) to denote

g(q)(a).

• For any µ, n 2 D(Q), the Wasserstein distance is given by

W1(µ, n) = inf E(X, Y),

where the expectation is taken with respect to a Q ⇥ Q-valued random

variable and the infimum is taken over all joint distributions of X ⇥ Y with

marginals µ and n respectively.

111

A.2 Preliminary Results

Lemma 4 The function

h(µ) =Z

Qg(q)dµ 8 µ 2 D(Q)

is continuous.

Proof 9 By assumption, g is Lipschitz continuous; let K < • be its Lipschitz constant

(assuming the sup-metric on U). Suppose dP(µ, µ0) e; then,

kh(µ)� h(µ0)k• =

�

�

�

�

Z

Qg(q)d(µ � µ0

)

�

�

�

�

• K sup

f2BL1(Q)

�

�

�

�

Z

Qf (q)d(µ � µ0

)

�

�

�

�

•

= KW1(µ, µ0)

K(diam(Q) + 1)dP(µ, µ0)

K(diam(Q) + 1)e

using the assumption of Lipschitz continuity in the first inequality, and compactness of Q

and the Kantorovich-Rubinstein dual representation of W1 in the following equality. The

second inequality follows from Theorem 2 in metric. So h is continuous.

Lemma 5 If dP(FZn , dq⇤) ! 0 a.s. , then also

dP (Conv(

FZn) , dq⇤) ! 0 a.s.

where Conv(FZn) denotes the convex hull of FZn .

Proof 10 Fix any dataset zn, constant a 2 [0, 1], and measures µ, µ0 2 Fzn . Again using

112

the dual representation,

W1(aµ + (1 � a)µ0, dq⇤) = supf2BL1(Q)

✓

Z

f (q)(.(aµ + (1 � a)µ0)� dq⇤)

◆

= supf2BL1(Q)

a

✓

Z

f (q)(.µ � dq⇤)

◆

+ (1 � a)

✓

Z

f (q)(.µ0 � dq⇤)

◆

a supf2BL1(Q)

✓

Z

f (q)(.µ � dq⇤)

◆

+ (1 � a) supf2BL1(Q)

✓

Z

f (q)(.µ0 � dq⇤)

◆

= aW1(µ, dq⇤) + (1 � a)W1(µ0, dq⇤) sup

µ2Fzn

W1(µ, dq⇤)

Moreover, using Theorem 2 in metric,

dP(aµ + (1 � a)µ0, dq⇤)2 W1(aµ + (1 � a)µ0, dq⇤),

and also

supµ2Fzn

W1(µ, dq⇤) (1 + diam(Q)) supµ2Fzn

dP(µ, dq⇤).

Thus, for every dataset zn,

dP(Conv(

Fzn) , dq⇤)2 (1 + diam(Q)) sup

µ2Fzn

dP(µ, dq⇤),

where diam(Q) is finite by compactness of Q. So dP(FZn , dq⇤) ! 0 a.s. implies

dP(Conv(

FZn) , dq⇤) ! 0 a.s., as desired.

Claim 6 Fix any agent i, and let tq⇤ be the type with common certainty in q⇤. If action

ai is strongly strict-rationalizable for agent i with type tq⇤ , then it is also weakly strict-

rationalizable for agent i in the complete information game with payoffs u⇤= g(q⇤).

Proof 11 By induction. Trivially R1j [tq⇤ ] = W1

j = Aj for every agent j. If aj /2 W2j , then

it is not a strict best response to any distribution over opponent actions, so also aj /2 R2j [tq⇤ ].

113

Thus,

R2j [tq⇤ ] ✓ W2

j 8 j.

Now, suppose Rkj [tq⇤ ] ✓ Wk

j for every agent j, and consider any agent i and action

ai 2 Rk+1i [tq⇤ ]. By construction of the set Rk+1

i [tq⇤ ], there exists some distribution p with

margQ⇥T�ip = ktq⇤ and p

�

a�i 2 Rk�i[t�i]

�

= 1 such that

Z

Q⇥T�i⇥A�i

ui(ai, a�i, q)p. >Z

Q⇥T�i⇥A�i

ui(a0i, a�i, q)p. + d 8a0i 6= ai.

But since Rki [ti] ✓ Wk

i , the distribution p also satisfies p�

a�i 2 Wki�

= 1. So ai is

a d-best response to some distribution p with support in the surviving set of weakly

strict-rationalizable actions, implying that ai 2 Wk+1i , as desired.

A.3 Appendix C: Main Results

A.3.1 Proof of Claim 1

I use the following notation. For every dataset zn = {(xk, p(xk)}nk=1, define

F(zn) = {p(0) : p 2 P and p(xk) = p(xk) 8 k = 1, . . . n}

and let Tzn be the set of hierarchies of belief with common certainty in F(zn).

(See footnote 8 for the definition of P.) Also, let t�1 be the type with common

certainty in �1, and let t1 be the type with common certainty in 1. Observe that R

is rationalizable for type t1 and not for type t�1.

Suppose F(zn) = {�1, 1}. Then t�1 2 Tzn , so there is a type in Tzn for whom

R is not rationalizable. Now suppose F(zn) = {1}. Then, the only permitted

type is t1, so R is trivially rationalizable for every type in Tzn . It follows that R is

rationalizable for every type in Tzn if and only if F(zn) = {1}; that is, if and only if

every inference rule p 2 P that exactly fits zn predicts p(0) = 1. For what datasets

114

zn does this hold?

We can reduce this problem by looking at whether the smallest hyper-rectangle

that contains every successful observation also contains the origin. This will be the

case if and only if for every dimension k, there exist observations (xi, 1) and (xj, 1)

such that xki < 0 and xk

j > 0 (that is, the k-th attribute is negative in some observed

high yield region, and positive in some observed high yield region). For every k,

this probability is

1 �

2✓

2c � c0

2c

◆n�✓

c � c0

c

◆n�

.

Realization of k-th attributes are independent across dimensions. Thus, the proba-

bility that this holds for every dimension is✓

1 �

2✓

2c � c0

2c

◆n�✓

c � c0

c

◆n�◆r

as desired.

A.3.2 Proof of Proposition 1

The proof of this proposition follows from two lemmas. The first is a straightfor-

ward generalization of Proposition 6 in faingold1, and relates common learning

to convergence of types in the uniform-weak topology. The second lemma says

that for every dataset z, the distance between tiz and tq⇤ is upper bounded by

dP(Fzn , dq⇤).

Throughout, I use tq⇤ to denote the type with common certainty in q⇤.

Lemma 6 Agent i commonly learns q⇤ if and only if

dUWi (ti

Zn, tq⇤) ! 0 a.s. as n ! •.

1This lemma appears in faingold for the case in which Q is a finite set and d0 is the discretemetric, but generalizes to any complete and separable metric space (Q, d0

) when the definition ofcommon learning is replaced by Definition 2.

115

Thus, the problem of determining whether an agent i commonly learns q is

equivalent to that of determining whether his random type tiZn

almost surely

converges to tq⇤ in the uniform-weak topology.

Lemma 7 For every dataset z.

dUWi (ti

z, tq⇤) dP(Fz, dq⇤) (A.1)

Proof 12 Fix any dataset z. It is useful to decompose the set of types Tz into the Cartesian

product ’•k=1 Hk

z, where H1z = Fz and for each k > 1, Hk

z is recursively defined

Hkz =

n

tk 2 Tk :�

margTk�1 tk�(Hk�1

z ) = 1 and margQ tk 2 H1z

o

; (A.2)

that is, Hkz consists of the k-th order beliefs of types in Tz. First, I show that every k-th

order belief in the set Hkz is within dP(Fz, dq⇤) (in the dk metric2) of the k-th order belief of

tq⇤ .

Claim 7 Define d⇤ = dP(Fz, dq⇤). For every k � 1,

Hkz ✓

n

tkq⇤od⇤

:=n

tk 2 Tk : dk(

t, tq⇤) do

.

Proof 13 Fix any t 2 Tz. By construction of Tz, the first-order belief of type t is in the set

Fz. So it is immediate that

d1(t, tq⇤) dP(Fz, dq) = d⇤. (A.3)

Now suppose Hkz ✓ �

tkq⇤ d⇤ . Then, since tk+1

⇣

�

tkq⇤ d⇤⌘

� tk+1(Hk

z) = 1 from (A.2),

and tk+1q⇤�{tk

q⇤}�

by definition of the type tq⇤ , it follows that

tk+1q⇤ (E) tk+1

(Ed⇤) + d⇤

2See Section 3.2.

116

for every measurable E ✓ Tk. Using this and (A.3),

dk+1(t, tq⇤) d⇤. (A.4)

as desired.

So dk(t, tq⇤) d⇤ for every k, implying dUW

i (t, tq⇤) = supk�1 dk(t, tq⇤) d⇤.

Thus, the question of convergence of types is reduced to the question of convergence

in distributions over Q. The remainder of the argument is now completed:

Fix any map ti : z 7! tiz such that ti

z 2 Tz for every z. Suppose M is uniformly

consistent; then supµ2M d(

µZn , dq⇤) ! 0 a.s.3. It follows from Lemma 6 that

dUWi (ti

Zn, tq⇤) ! 0 a.s.,

so that agent i’s (interim) type tiZn

almost surely converges to tq⇤ . Using Lemma 7,

agent i commonly learns q.

For the other direction, suppose M is not uniformly consistent. Then, there

exist constants e, d > 0 such that for n sufficiently large,

supµ2M

d(

µ(zn), dq⇤)) > e (A.5)

for every zn in a set Z⇤n of Pn-measure d. Define the map ti such that for every

dataset zn 2 Z⇤n , agent i’s first-order belief is µ(zn) for some µ 2 M satisfy-

ing d(µ(zn), dq⇤) > e (existence guaranteed by (A.5)). Then d1(ti

Zn, tq⇤)0, so also

dUWi (ti

Zn, tq⇤)0, and it follows from Lemma 6 that agent i does not commonly learn

q.

3Uniform convergence in W1 implies uniform convergence in the Prokhorov metric d. See forexample metric.

117

A.3.3 Proof of Claim 3

I prove this claim in two parts. Recall that UNEa is the set of all complete information

games in which a is a Nash equilibrium. Thus, the set h�1 �UNEa�

is the set of all

distributions over Q that induce an expected payoff in UNEa . The first claim says

that dq⇤ 2 Int�

h�1 �UNEa��

if and only if h(FZn) is almost surely contained in UNEa

as the quantity of data n tends to infinity.

Claim 8 limn!• h(

FZn) ✓ UNEa a.s. if and only if dq⇤ 2 Int

�

h�1 �UNEa��

.

Proof 14 Sufficiency. Suppose dq⇤ 2 Int�

h�1 �UNEa��

. Recall that under uniform

consistency, W1(FZn , dq⇤) ! 0 a.s., so that

limn!•

FZn ✓ V a.s.

for every open set V with dq⇤ 2 V. This implies in particular that

limn!•

FZn ✓ h�1(UNE

a ) a.s.

Using continuity of h (see Lemma 4), it follows from the continuous mapping theorem that

limn!•

h(

FZn) ✓ UNEa a.s.

as desired.

Necessity. Suppose dq⇤ /2 Int(h�1(UNE

a )). Under assumption NI, there exists a

constant d > 0 independent of n, and a set Z⇤n of measure d, such that

dq⇤ 2 Int(Fzn) 8 zn 2 Z⇤n .

Consider any dataset zn 2 Z⇤n . Since dq⇤ /2 Int

�

h�1 �UNEa��

, necessarily Fzn h�1 �UNEa�

.

It follows that

limn!•

Pn⇣n

zn : h(

Fzn) ✓ UNEa

o⌘

< 1

118

as desired.

Claim 9 dq⇤ 2 Int�

h�1 �UNEa��

if and only if u⇤ 2 Int�

UNEa�

.

Proof 15 Suppose u⇤ 2 Int�

UNEa�

. Then, there is an open set V such that

u⇤ 2 V ✓ UNEa .

Since h is continuous (see Lemma 4), h�1(V) is an open set in D(Q). So

dq⇤ 2 h�1(V) ✓ h�1

⇣

UNEa

⌘

implying that dq⇤ 2 Int�

h�1 �UNEa��

, as desired.

For the other direction, suppose towards contradiction that dq⇤ 2 Int�

h�1 �UNEa��

but

u⇤ /2 Int�

UNEa�

. Since u⇤ is on the boundary of UNEa , there exists some agent i and action

a0i 6= ai such that

u⇤i (a0i, a�i) � u⇤

i (a0i, a�i).

Under assumption 4, g(Q) has nonempty intersection with S(i, ai), so there exists some

q 2 g�1(S(i, ai)). For every e > 0, define

µe = (1 � e)dq⇤ + edq .

The expected payoff under µe satisfies

Z

Uui(a0i, a�i)g. ⇤(µe) >

Z

Uui(ai, a�i)g. ⇤(µe)

where g⇤(n) denotes the push forward measure of n 2 D(Q) under the map g. So ai is not a

best response to a�i given beliefs µe over Q, and therefore h(µe) /2 UNEa . This implies also

µe /2 h�1 �UNEa�

. Thus the sequence µe ! dq⇤ and has the property that µe /2 h�1 �UNEa�

for every e, so dq⇤ /2 Int(h�1(UNE

a )), as desired.

119

A.3.4 Proof of Theorem 2

Only if: Define URa⇤i✓ U to consist of all payoffs u such that a⇤i is rationalizable for

player i in the complete information game with payoffs u.

Lemma 8 u 2 Int⇣

URa⇤i

⌘

if and only if a⇤i survives every round of weak strict-rationalizability

in the complete information game with payoffs u.

Proof 16 Only if: Suppose a⇤i fails to survive some iteration of weak strict-rationalizability.

Then, there exists a sequence of sets⇣

Wkj

⌘

k�1for every agent j satisfying the recursive

description in Section 5.1, such that a⇤i /2 WKi for some K < •. To show that u /2

Int⇣

URa⇤i

⌘

, I construct a sequence of payoff functions un with un ! u (in the sup-metric)

such that a⇤i is not rationalizable in any complete information game with payoffs along this

sequence, for n sufficiently large.

For every n � 1, define the payoff function un as follows. For every agent j, let un,1j

satisfy

un,1j (aj, a�j) = uj(aj, a�j) + e/n 8 aj 2 Wk�1

j and 8 a�j 2 A�j

un,1j (aj, a�j) = uj(aj, a�j) otherwise.

Recursively for k � 1, let un,kj satisfy

un,kj (aj, a�j) = un,k�1

j (aj, a�j) + e/n 8 aj 2 Wk�1j and 8 a�j 2 A�j

un,kj (aj, a�j) = un,k�1

j (aj, a�j) otherwise.

Define un such that unj := un,K

j for every player j.

I claim that a⇤i is not rationalizable in the complete information game with payoff

function un, for any n sufficiently large. To show this, let us construct for every player j the

sets (Sk,nj )k�1 of actions surviving k rounds of iterated elimination of strictly dominated

strategies given payoff function un, and show that for n sufficiently large, Sk,nj = Wk

j for

120

all k and every player j. I will use the following intermediate results.

Claim 10 There exists g > 0 such that for any u0 satisfying ku0 � uk• < g, and for any

agent j, if

uj(aj, a�j) > maxa0j 6=aj

uj(aj, a�j)

then

u0j(aj, a�j) > max

a0j 6=aju0

j(aj, a�j).

Proof 17 Let g =

12 mini2I minai2Ai

�

�

�

ui(ai, a�i)� maxa0i 6=aiui(a0i, a�i)

�

�

�

, which exists

by finiteness of I and action sets Ai. The claim follows immediately.

Corollary 5 Let N = eK/g. Then, for every n � N, if

uj(aj, a�j) > maxa0j 6=aj

uj(aj, a�j)

then

un,kj (aj, a�j) > max

a0j 6=ajun,k

j (aj, a�j)

for every k � 1.

Proof 18 Directly follows from Claim 13, since for every j,

kun,kj � ujk• kun

j � ujk• eKn

by construction.

The remainder of the proof proceeds by induction. Trivially, S0,nj = W0

j = Aj for every

j and n. Now consider any agent j and action aj 2 Aj. Suppose there exists some strategy

a�j 2 D(A�j) such that

uj�

aj, a�j�� max

a0j 6=ajuj

⇣

a0j, a�j

⌘

> 0,

121

so that aj is a strict best response to a�j under u. Then aj 2 W1j , and for n � N, also

aj 2 S1,nj (using Corollary 5). Suppose aj is never a strict best response, but there exists

a�j 2 D(A�j) such that

uj�

aj, a�j�� max

a0j 6=ajuj

⇣

a0j, a�j

⌘

= 0.

If aj 2 W1j , then

unj�

aj, a�j�� max

a0j 6=ajun

j

⇣

a0j, a�j

⌘

� uj�

aj, a�j�� max

a0j 6=ajuj

⇣

a0j, a�j

⌘

,

so also ai 2 S1,ni for n � N. If aj /2 Wj, then for n � N, there exists an action a0j 6= aj

such that uj

⇣

a0j, a�j

⌘

= uj�

aj, a�j�

, but uni

⇣

a0j, a�j

⌘

> unj�

aj, a�j�

. So aj /2 S1,nj . No

other actions survive to either W1j or S1,n

j . Thus S1,nj = W1

j for all n � N.

This argument can be repeated for arbitrary k. Suppose Sk,nj = Wk

j for every j and

n � N, and consider any action aj 2 Sk,nj . If there exists some strategy a�j 2 D(Sk,n

�j )

such that

uj�

aj, a�j�� max

a0j 6=ajui

⇣

a0j, a�j

⌘

> 0,

then aj 2 Wk+1j , and for n � N, also aj 2 Sk+1,n

j (using Corollary 5). Suppose aj is not a

strict best response to any a�j 2 D(Sk,n�j ), but there exists a�j 2 D(Sk,n

�j ) such that

uj�

aj, a�j�� max

a0j 6=ajuj

⇣

a0j, a�j

⌘

= 0.

Then, if aj 2 Wk+1j , action aj is a strict best response to a�j under un, so aj 2 Sk+1,n

j .

Otherwise, if aj /2 Wk+1j , then there exists some a0j 2 Wk+1

j such that unj (a0j, a�j) >

unj (aj, a�j), so also aj /2 Sk+1,n

j . No other actions survive to either Wk+1j or Sk+1,n

j , so

Sk+1,nj = Wk+1

j for n � N. Therefore Sk,nj = Wk

j for every k and n � N, and in particular

SK,nj = WK

j for n � N. Since aj /2 WKj , also aj /2 S•,n

j for n sufficiently large, as desired.

122

Finally, notice that by construction kun � uk• eKn , which can be rewritten

kun(e0) � uk• e0

where n(e0) := eKe0 . Thus, for every e0 � 0, the payoff function un(e0)

i 2 Be0(u), but ai

is not rationalizable in the complete information game with payoff function un(e0)i . So

u /2 Int⇣

URa⇤i

⌘

, as desired.

If: Suppose u /2 Int⇣

URa⇤i

⌘

. Consider any sequence of payoff functions un ! u.

Since action sets are finite, there is a finite number of possible orders of elimination. This

implies existence of a subsequence along which the same order of iterated elimination of

strategies removes a⇤i . Choose any one-at-time iteration of this order of elimination. Then,

a⇤i fails to survive this order of elimination given the limiting payoffs u, so it is not weakly

strict-rationalizable.

Next, I show that ai is robust to inference only if the true payoff function u⇤ is

in the interior of URa⇤i

.

Lemma 9 a⇤i is robust to inference only if u⇤ 2 Int⇣

URa⇤i

⌘

.

Proof 19 The following claim will be useful.

Claim 11 u⇤ 2 Int⇣

URa⇤i

⌘

if and only if dq⇤ 2 Int(h�1(Ua)).

Proof 20 See proof of Claim 8.

Suppose u⇤ /2 Int(URa⇤i); then, using Claim 11, also dq⇤ /2 Int(h�1

(URa⇤i)). Under

assumption NI, there is a constant e > 0 such that dq⇤ 2 Int(Fzn) for at least an e-measure

of datasets. Consider any such such dataset. Then, dq⇤ /2 Int⇣

h�1(UR

a⇤i)

⌘

, implies that

Fzn h�1(Ua). Fix any u 2 Fzn\h�1

(URa⇤i). Then a⇤i is not rationalizable in the complete

information game with payoffs u, so it is also not rationalizable for the type with common

certainty in u.

123

If: If a⇤i is strongly strict-rationalizable, then there exists a family of sets (Vkj )j2I

is closed under d-strict best reply for some d � 0; that is, for every aj 2 Vkj , there

exists a distribution a�j 2 D(Vk�j) such that

u⇤j (aj, a�j) > max

a0j 6=aju⇤

j (a0j, a�j) + d.

Recall the following fixed-point property of the set of rationalizable actions:

Lemma 10 (ICR) Fix any type profile (tj)j2I . Consider any family of sets Vj ✓ Aj such

that every action aj 2 Vj is a best reply to a distribution p 2 D(Q ⇥ T�j ⇥ A�j) that

satisfies margQ⇥T�jp = g(tj) and p(a�j 2 V�j[t�j]) = 1. Then, Vj ✓ S•

j [tj] for every

agent j.

Fix any e > 0. Then, for every agent j and type tj with common certainty in Be(u⇤),

we have that

Z

uj(aj, a�j, q)dkj(tj)�maxa0j 6=aj

Z

uj(a0j, a�j, q)dkj(tj)

� infu2Be(u⇤

)

uj(aj, a�j)� maxa0j 6=aj

uj(a0j, a�j)

!

� d � 2e,

which is positive for any e d/2. So the family of sets (Vkj )j2I satisfies the

conditions in Lemma 10 when e is sufficiently small, and it follows that a⇤i 2 S•i [tj],

as desired.


To simplify notation, set d := dNEa⇤ . By assumption, d � 0.

Lemma 11 Bd/2(u⇤) ✓ UNE

a⇤ .

124

Proof 21 Consider any payoff function u0 satisfying

ku0 � u⇤k• d

2. (A.6)

Then for every agent i,

u0i(a⇤i , a⇤�i)� u0

i(a0i, a⇤�i) = u0i(a⇤i , a⇤�i)� u⇤

i (a⇤i , a⇤�i)| {z }

��d/2

+ u⇤i (a⇤i , a⇤�i)� u⇤

i (a0i, a⇤�i)| {z }

>d

+ u⇤i (a0i, a⇤�i)� u0

i(a0i, a⇤�i)| {z }

��d/2

� 0.

where u⇤i (a⇤i , a⇤�i)� u⇤

i (a0i, a⇤�i) > d follows from the assumption that a⇤ is a d-strict NE in

the complete information game with payoffs u⇤, and the other two bounds follow from A.6.

So a⇤ is a NE in the complete information game with payoffs u0, implying that u0 2 UNEa⇤ .

It follows from Lemma 2 that common certainty in Bd/2(u⇤) is a sufficient condition

for a⇤ to be a Bayesian Nash equilibrium. Thus,

pNEn (a⇤) � Pn

({zn : h(Fzn) ✓ Bd/2(u⇤)})

= Pn

(

zn : supµ2M

kh(µzn)� u⇤k• d/2

)!

= 1 � Pn

(

zn : supµ2M

kh(µzn)� u⇤k• > d/2

)!

� 1 � 2d

EPn

supµ2M

kh(µzn)� u⇤k•

!

using Markov’s inequality in the final line.


To simplify notation, set d := dRa⇤i

. By assumption, d � 0.

Lemma 12 Bd/2(u⇤) ✓ UR

a⇤i.

125

Proof 22 Consider any payoff function u0 satisfying

ku0 � u⇤k• d

2. (A.7)

By definition of dRa⇤i

, there exists a family of sets (Ri)i2I with the property that for every

agent j and action aj 2 Rj, there is an action a�j[aj] 2 D(R�j) satisfying

u⇤i (aj, a�j[ai]) > u⇤

i (a0j, a�j[aj]) + d 8 a0j 6= aj. (A.8)

I will show that (Rj)j2I satisfies the conditions in Lemma 10 for any type profile (tj)j2I ,

where every tj has common certainty in Bd/2(u⇤). Fix an arbitrary agent j, and type tj with

common certainty in Bd/2(u⇤). Define the distribution p 2 D(Q ⇥ T�j ⇥ A�j) such that

margQ⇥T�jp = kj(tj) and margA�j

p = a�j[aj], noting that since a�j[aj] 2 D(R�j),

this implies also that p(a�j 2 R�j) = 1.

Since by assumption, tj has common certainty in Bd/2(u⇤), the support of margQ k(tj)

is contained in Bd/2(u⇤). So the expected payoff from playing aj exceeds the expected payoff

from playing a0j 6= aj by at least

infu2Bd/2(u⇤

)

⇣

u(aj, a�j)� u(a0j, a�j)⌘

� � d

2(A.9)

It follows that

Z

uj(aj, a�j,q)p. �Z

uj(a0j, a�j, q)p. =

Z

uj(aj, a�j, q)p. � u⇤j (aj, a�j, q)

| {z }

�� 12 d

+ u⇤(aj, a�j, q)� u⇤

(a0j, a�j, q)| {z }

>d

+

Z

u⇤j (a0j, a�j, q)p. � uj(a0j, a�j, q)

| {z }

�� 12 d

� 0,

using the inequalities in (A.8) and (A.9). It follows that aj is a best response to a�j given

distribution p. Repeating this argument for every agent j, action aj 2 Rj, and type tj with

common certainty in Bd/2(u⇤), it follows from Lemma 10 that Rj ✓ S•

j [tj] for every agent

126

j. Since a⇤i 2 Ri, also a⇤i 2 S•i [ti], as desired.

It follows from this lemma that Fz ✓ Bd/2(u⇤) is a sufficient condition for a⇤i to

be rationalizable in every game in G(z). Thus,

pRn (i, a⇤i ) � Pn

({zn : h(Fzn) ✓ Bd/2(u⇤)})

= Pn

(

zn : supµ2M

kh(µz)� u⇤k• d/2

)!

= 1 � Pn

(

zn : supµ2M

kh(µz)� u⇤k• > d/2

)!

� 1 � 2d

EPn

supµ2M

kh(µzn)� u⇤k•

!

using Markov’s inequality in the final line.

Proof of Corollary 4

From properties of the least-squares estimator,

E�|b1 � b1|2

�

= Var(b j) Âj

Var(b j)

= s2 Âk

E

✓

⇣

XTX⌘�1

kk

◆

= s2E

✓

tr⇣

XTX⌘�1

◆

= s2E

Âi

l�1i

!

s2 p(p

n +

pp)

where the final line follows from Gordon’s theorem for Gaussian matrices (see e.g.

matrices). Let K be the Lipschitz constant of the map g : Q ! U (assuming the

127

sup-norm on U and the Euclidean norm on Q),

E

supµ2M


!

KE�|b1 � b1|2 + f2

n�

K�

s2 p�p

n +

pp�

+ f2n�

and the desired bound follows directly from Proposition 2.


The argument below is for Nash equilibrium; the argument for rationalizability

follows analogously. For every inference rule µ 2 M, define

Xnµ = 1

⇣

h(

µZn) /2 UNEa

⌘

to take value 1 if the expected payoff under the (random) distribution µ(Zn) is

outside the set UNEa . Write Fn

µ for the marginal distribution of random variable

Xnµ, and Fn

M for the joint distribution of random variables (Xnµ)µ2M. Enumerate the

inference rules in M by µ1, . . . , µk.

By Sklar’s theorem, there exists a copula C : [0, 1]k ! [0, 1] such that

FnM(x1, . . . , xk) = C

⇣

Fnµ1(x1), . . . , Fn

µk(xk)

⌘

for every x1, . . . , xk. Using the Frechet-Hoeffding bound,

1 � K +

K

Âk=1

Fnµk(xk) C

⇣

Fnµ1(x1), . . . , Fn

µk(xk)

⌘

mink2{1,...,K}

Fnµk(xk).

From Lemma 2, pNEn (a) = Fn

M(0, . . . , 0). It follows that

1 � K +

K

Âi=1

Fnµk(0) pNE

n (a) mink2{1,...,K}

Fnµk(0). (A.10)

128

Finally, since every Xnµ ⇠ Ber(1 � pn

µ), (A.10) implies

1 � Âµ2M

pNEµ,n pNE

n (a) 1 � minµ2M

pNEµ,n

as desired.

A.4 Appendix D: An example illustrating the fragility of

weak strict-rationalizability

In the following, I present a game in which an action is weakly strict-rationalizable,

but fails to be rationalizable along a sequence of perturbed types in the uniform-

weak topology.

Consider a game with four players. Each has two actions, a and b. Throughout

I will use, for example, abab to denote choice of a by players 1 and 3, and b by

players 2 and 4. Let payoffs be defined as follows. Player 1’s payoffs satisfy

u1(axxx) =

8

>

<

>

:

1 if xxx = aaa or bbb

0 otherwise.

u1(bxxx) =

8

>

<

>

:

0 if xxx = aaa or bbb

1 otherwise.

That is, player 1 wants to play a if players 2-4 are all playing a or all playing b, and

he wants to play b otherwise. The payoffs to players 2-4 are independent of player

1’s action. They are described below (where rows correspond to player 2’s actions,

columns to player 3, and choice of matrices to player 4), with player 1’s payoffs

omitted, so that the first coordinate corresponds to player 2’s payoff:

129

a b

a 1, 1, 0 0, 0, 0

b 0, 0, 0 0, 0, 0

a b

a 0, 0, 0 0, 0, 0

b 0, 0, 0 1, 1, 0

(A.11)

(a) (b)

That is, if player 4 chooses action a, then players 2 and 3 prefer coordination on a;

and if player 4 chooses b, then players 2 and 3 prefer coordination on b.

Let us first consider the case in which the true payoffs are common certainty, so

that this is a game of complete information (denote the payoffs by u). Then, a is

rationalizable for player 1. Not only is it rationalizable, but:

• there is a constant e > 0 such that a is rationalizable for player 1 in every

game u0 with ku0 � uk• e; that is, rationalizability is preserved on an open

set of complete information games.

• a is weakly strict-rationalizable.

• although a is not strongly strict-rationalizable, it fails to survive this process

for the reason that none of player 4’s actions survive the first round of

elimination.4

Let t1 be the type with common certainty in u. I will now show that there exists

a sequence of types tn1 such that tn

1 ! t1 in the uniform-weak topology, but a fails

to be rationalizable for agent 1 infinitely many times along this sequence. The

sequence of types tn1 will moreover have the property that every tn

1 believes that an

en-neighborhood of u is common certainty, where en ! 0 as n ! •.

4In particular, a is strongly strict-rationalizable in either game in which one of player 4’s actionsis dropped.

130

Define tn1 to satisfy two conditions. First, player 1 is certain5 that: player 2 is

certain that the payoffs in (A.11) are

a b

a 1, 1,�en 0, 0,�en

b 0, 0,�en 0, 0,�en

a b

a �en,�en, 0 0,�en, 0

b 0, 0, 0 1, 1, 0

(A.12)

(a) (b)

and player 2 is certain, moreover, that player 4 is certain of these payoffs. Second,

player 1 is certain that: player 3 is certain that the payoffs in (A.11) are

a b

a 1, 1, 0 0, 0, 0

b �en,�en, 0 �en,�en, 0

a b

a 0, 0,�en 0, 0,�en

b 0, 0,�en 1, 1,�en

(A.13)

(a) (b)

and player 3 is certain, moreover, that player 4 is certain of these payoffs.

Let us now consider the rationalizable actions for players 2 and 3. If player 4

is certain that payoffs are as in (A.12), then action b is his uniquely rationalizable

action. So player 2, with the beliefs described above, believes with probability 1

that player 4 will play b. Since he is himself certain of the payoffs in (A.12), action

b is his uniquely rationalizable action. By a similar argument, if player 4 is certain

that payoffs are as in (A.13), then action a is uniquely rationalizable. So player 3,

with the beliefs described above, believes with probability 1 that player 4 will play

a, and thus considers a to be his uniquely rationalizable action as well.

So player 1 is certain that player 2 will play b and that player 3 will play a. It

follows that his uniquely rationalizable action is b. Since this argument is valid for

5Believes with probability 1.

131

every en > 0, action a is not rationalizable for player 1 of type tni for any n. But

every tni believes that Ben(u) is common certainty, so tn

i ! ti in the uniform-weak

topology.

132

Appendix B


B.1 Experiment Instructions

Subjects were presented with the following introduction screen: Following a trial

round and provision of consent, subjects were presented with 50 identical screens

that looked like the following: Subjects were given 30 seconds to complete each

string, and a timer displayed their remaining time.

B.2 Behavioral Prediction Rules

Rabin prediction rules. Define continuation rule

fR(s1:7) = p(0.5) +6

Âk=0

p(1 � p)k 0.5N � Â7j=7�k sk

N

and classification rule

cR(s) = Âr2{0,1}8

⇣

pÂ ri(1 � p)8�Â ri

⌘

q(s|r)

133

where

q(s|r) = 0.5rk + (1 � rk)

0

@

0.5N � Âmin j : rk�j=1j=1 rk�j1(sk�j = sk)

N

1

A

is the probability that string s is generated when the urn is refreshed at every ‘1’ in

r. There are two free parameters: p 2 [0, 1] and N 2 N.

gambler prediction rules. Define prediction rule

fRV(s1:7) = 0.5 � a7

Âk=0

dk(2s7�k � 1).

Define classification rule

cRV(s) = Âk

sk

0.5 � a Âjk

dk�jg(sj)

!

+ (1 � sk)

0.5 + a Âjk

dk�jg(sj)

!

.

There are two free parameters: d 2 [0, 1] and a 2 R+

.

134

Appendix C


C.1 Proof of Theorem 1

C.1.1 Preliminary Notation and Results

I use the following objects and definitions. A hypergraph is a pair H = (V, E) where

V is a finite nonempty set, called the set of vertices, and E is a family of distinct

subsets of V, called the set of edges.1 A k-coloring of a hypergraph is a partition of

its vertex set V into k color classes such that no edge in E is monochromatic. A

hypergraph is k-colorable if it admits an k-coloring. Finally, G = (V, E) is a complete

k-partite graph if there is a partition {Vi}ki=1 of the vertex set V such that {u, v} 2 E

if and only if u and v are in different partitions. The set of all hypergraphs on M

vertices is denoted H. In the remainder of this proof, I refer to hypergraphs simply

as graphs.

These concepts are related to our problem as follows. Enumerate the observa-

tions in any dataset D = {(x, A) : A 2} as {(xi, Ai)}Mi=1. These choice observations

can be identified with a graph H = (V, E) where V = {1, 2, . . . , M} indexes ob-

1This is a generalization of a graph in which edges may connect more than two vertices.

137

servations, and E consists of every set T ✓ V such that: (1) the observations in

{(xi, Ai) | i 2 T} are inconsistent2, and (2) no proper subset of {(xi, Ai) | i 2 T}is inconsistent. I refer to the vertices of H and the observations they represent

interchangeably.

Claim 12 The following statements are equivalent:

1. H is k-colorable.

2. D is k-rationalizable.

Proof 23 Take each color class to represent consistency with a distinct ordering, and the

equivalence directly follows.

For any graph H, let fH be the linear interpolation of points {(k, Dk,H) : k 2Z

+

}, where Dk,H is the minimal number of nodes in H that must be removed for

H to become k-colorable.3 Let FH be the convex hull of the epigraph of fH (see

Figure C.1), and define c := 1l . Then, if there does not exist k 2 N satisfying

Dk,H < DK,H + c(K � k), 4 (C.1)

the line h = {x | (�1,�l) · x = (�1,�l) · (K, DK,H)} properly supports FH at

(K, DK,H), and any solution R⇤ to the minimization problem argminR✓R |R| +lE(D, R) must satisfy |R⇤| = K, as desired.

Finally, suppose that p = 0, so that the agent perfectly maximizes his context-

dependent ordering. This identifies a deterministic graph G.

2There does not exist an ordering r such that xi is r-maximal in Ai for every i 2 T.

3This is equivalent to the definition of Dk used in the main text, through Claim 3.

4In vector notation,(�1,�l) · (K � k,k,H �K,H) � 0.

138

FH

fH

K,H

K

k,H

k

Figure C.1: Any choice of l for which (�1,�l) is a subgradient of fH at (K, DK,H) will recover K.With high probability, the set of vectors

n

(�1,� 1(p+d)

) | d 2⇣

0, d(1�p)K

M � 2p � b⌘o

is a subset of thesubdifferential of fH at (K, DK,H).

Claim 13 G includes at least d non-overlapping complete K-partite subgraphs.5

Proof 24 Each subgraph induced by the vertices in a K-violation of IIA is a complete

K-partite graph.

C.1.2 Main Proof

Imperfect maximization using the random choice rule P generates a probability

distribution over H. Denote the random graph with this distribution by H. Fix

d 2⇣

0, d(1�p)K

M � 2p � b⌘

and l = (p + d)M. I will now show that the probability

that no k 2 N satisfies (C.1) is at least 1 � O�

c�M�, from which it will follow that

the probability that Eq. (3.2) recovers K is at least 1 � O�

c�M�.

In the subsequent claims, take S ⇠ Bin(M, p) to be the number of observations

which are imperfectly maximized, and take VE ✓ V to be the random variable

whose outcome is the set of imperfectly maximized observations.

5Two subgraphs are said to be non-overlapping if they do not share vertices.

139

Lemma 13 The probability that no k > K satisfies (C.1) is at least 1 � e�2d2 M.

Proof 25 Since DK,H S, if there exists k > K such that Dk,H < DK,H � c(k � K), then

necessarily S > c = (p+ d)M. Otherwise, Dk,H < c(K � k+ 1) 0. Since E(S) = pM,

it follows from Hoeffding’s Inequality that

Pr(S � c) = Pr(S � pM � c � pM) exp✓

�2((p + d)M � pM)

2

M

◆

= e�2d2 M,

and therefore the probability that no k > K satisfies (C.1) is at least 1 � e�2d2 M, as desired.

Lemma 14 The probability that no k < K satisfies (C.1) is at least

1 � e�b22 M�

1 � exp✓

� (1 � p)Kb2M2(2p + d + b)

◆�

.

Proof 26 If there exists k < K satisfying (C.1), then H must include strictly fewer than

c + S non-overlapping complete K-partite graphs. Otherwise, Dk,H � (c + S)(K � k) �DK,H + c(K � k) for every k < K, since every complete K-partite graph is K-colorable and

every subgraph of a complete graph is itself a complete graph.

Define HP to be the random subgraph of H induced by vertices in V\VE (perfectly

maximized observations). Then, if HP contains at least c + S non-overlapping complete

K-partite graphs, H must also. I determine the probability that HP includes at least c + S

non-overlapping complete K-partite graphs as a lower bound.

I first show that S <⇣

p +

b2

⌘

M with probability at least 1� e�b22 M, and subsequently

that conditional on the eventn

S <⇣

p +

b2

⌘

Mo

, subgraph HP includes at least c + S

non-overlapping complete K-partite graphs with probability 1 � exp⇣

� (1�p)K b2 M2(2p+d+b)

⌘

. The

first statement follows from immediately from Hoeffding’s inequality, since E(S) = pM

140

and

Pr✓

S <

✓

p +

b

2

◆

M◆

� 1 � e�b22 M. (C.2)

Suppose S < (p +

b2 )M. Index the d complete K-partite subgraphs in G (existence from

Claim 13) by i = 1, 2, . . . , d. Let Xi be the indicator variable which takes value 1 if every

vertex in complete K-partite subgraph i is perfectly maximized, and let X = Âdi=1 Xi. Notice

Xi ⇠ Ber((1 � p)K) for every i and EX = d(1 � p)K. Using Hoeffding’s inequality,

Pr(X < c + S) = Pr⇣

X � d(1 � p)K < c + S � d(1 � p)K⌘

exp✓

�2(c + S � d(1 � p)K

)

2

d

◆

.

Since by assumption d 2⇣

0, d(1�p)K

M � 2p � b⌘

, it follows that d � (2p+b+d)M(1�p)K . Therefore,

d(1 � p)K > (2p + d + b)M > c + S, so

∂ f (S)∂d

= �2(1 � p)2K+

2(c + S)2

d2 < 0, and

∂ f (S)∂S

= 4(1 � p)K � 4(c + S)d

> 0.

where f (S) = �2(c+S�d(1�p)K)

2

d . Then, using the upper bound on S and the lower bound

on d,

exp✓

�2(c + S � d(1 � p)K

)

2

d

◆

exp✓

� (1 � p)Kb2M2(2p + d + b)

◆

. (C.3)

From (C.2) and (C.3), the probability that no k > K satisfies (C.1) is at least

1 � e�b22 M�

1 � exp✓

� (1 � p)Kb2M2(2p + d + b)

◆�

as desired.

141

Using Lemmas 13 and 14, the probability that no k 2 Z+

satisfies (C.1) is at least

1 � e�b22 M�

1 � exp✓

� (1 � p)Kb2M2(2p + d + b)

◆�

� exp��2d2M

�

= 1 � O(c�M),

where c = minn

exp( b2

2 ), exp⇣

(1�p)K b2

2(2p+d+b)

⌘

, exp�

2d2�o

.

C.2 Corollary 1

Take b = 0.1 and d = .05. Then for any d > 2p+d(1�p)K M =

0.25(1�p)K M,

d(1 � p)K

M� 2p � b > 0.25 � 2p � b = 0.05

so that d = 0.05 2⇣

0, d(1�p)K

M � 2p � b⌘

. Directly apply Theorem 1.

C.3 Corollary 2

By assumption, P includes at least d non-overlapping sets of choice problems in

K-violation of IIA. Enumerate the Kd choice problems included in a K-violation

using i = 1, . . . , Kd. Let Zi be the random variable whose outcome is the number of

times choice problem i is sampled, and let Qi be the event {Zi < a� M

2NK

�}, where

a =

2p+ b2 +d

2p+b+d < 1. Then,

Pr

Kd[

i=1Qi

!

Kd

Âi=1

Pr(Qi) = Kd Pr✓

Z1 < a

✓

M2NK

◆◆

.

142

Since Z1 ⇠ Bin�

M, 12NK

�

and E(Z1) =M

2NK , it follows from Hoeffding’s Inequality

that

Kd Pr✓

Z1 � M2NK

<aM2NK

� M2NK

◆

Kd exp

�2⇥

(1 � a) M2N

⇤2

M

!

= Kd exp

� (1 � a)2M22N�1

�

:= g(K, d, p, b, d, M)

(C.4)

Therefore Pr�

Zi � a� M

2NK

�

for every i�

= 1�Pr⇣

SKdi=1 Qi

⌘

� 1�Kd exph

� (1�a)2 M22N�1

i

.

Conditional on the event {Zi � a� M

2NK

�

for every i}, there are at least da� M

2NK

�

>

2p+ b2 +d

(1�p)K M non-overlapping sets of K choice problems in K-violation of IIA in the

sampled data, and we can apply Theorem 1 to conclude that probability of recovery

has lower bound

f (K, d, p, b, d, M) :=

1 � e�b28 M�

"

1 � exp

� (1 � p)Kb2M8(2p + d + b

2 )

!#

� exp��2d2M

�

.

(C.5)

From (C.4) and (C.5), the probability of recovery is at least

(1 � g(K, d, p, b, d, M)) f (K, d, p, b, d, M) = 1 � O(c�M)

with

c := c(K, d, p, b, d) = min { exp✓

b2

8

◆

, exp

(1 � p)Kb2

8(2p + d + b2 )

!

,

exp(2d2), exp

2

4

122N�1

1 � 2p +

b2 + d

2p + b + d

!23

5

9

=

;

> 1

as desired.

143

C.4 Proof of Proposition 1

For any set of orderings R and ordering r 2 R, define g(r, R) to be the set of

all choice observations consistent with maximization of r and inconsistent with

maximization of any other r0 2 R.6 The set of revealed preferences in g(r, R) is

given by the binary relation

Br := {(x, y) : (x, A) 2 g(r, R) for some A including y}.

Let Br be its transitive closure. The following is a necessary condition for identifia-

bility of R.

Claim 14 Suppose there exist orderings r, r0 2 R such that

argmaxi ri 6= argmini r0i. (C.6)

Then, R is not identified.

Proof 27 First, I show that R =

�

r1, . . . , rK is identified only if Br is complete for every

r 2 R. Suppose to the contrary that R is identified, but (without loss of generality) Br1 is

not complete. Then there exists some ordering r1 6= r1 such that every choice observation in

g(r1, R) is consistent with maximization of r1, so we can replace r1 with r1 in R without

loss of any choice implications. Formally, define R0=

�

r1, . . . , rK . Since C(R) ✓ C(R0),

it follows from Observation 1 that R is not identified.

Next, I show that if there exist orderings r, r0 2 R satisfying (C.6), then Br is

not complete for some r 2 R. Index the alternatives such that x1 := argmaxi ri and

x2 := argmini r0i. I show that neither (x1, x2) nor (x2, x1) is in Br0 , and hence Br0 is not

complete. Suppose towards contradiction that (x1, x2) 2 Br0 . Then (x1, A) 2 g(r0, R)

6For example, if R = {(1, 2, 3), (2, 3, 1)}, then g((1, 2, 3), R) = {(x3, {x1, x2, x3}), (x3, {x1, x3}),(x3, {x2, x3})}, since these observations are consistent with maximization of (1, 2, 3) and inconsistentwith maximization of (2, 3, 1).

144

for some A 2. But since x1 is r-maximal, every observation in which x1 is selected is

consistent with maximization of ordering r0. Thus, for every A 2, choice observation

(x1, A) /2 g(r0, R). This yields the desired contradiction. Suppose alternatively that

(x2, x1) 2 Br0 . Then, (x2, A) 2 g(r0, R) for some A 2. But x2 is ranked last according to

r0, so (x2, A) /2 g(r0, R) for every A 2. This yields the desired contradiction. Therefore, if

there exist orderings r, r0 2 R satisfying (C.6) then R is not identified.

It follows immediately from Claim 14 that every set R =

�

r1, r2 with argmaxi rji 6=

argmini r3�ji for some j 2 {1, 2} is not identified. Moreover, since every set R with

|R| � 3 must include orderings satisfying (C.6), it follows from Claim 14 that every

set R with three or more orderings orderings is not identified.

Next, I show that sets R =

�

r1, r2 with argmaxi rji = argmini r3�j

i for j =

1, 2 are identified. Index the alternatives such that x1 = argmaxi r1i and xN =

argmaxi r2i , and define D = C(r1

) [ C(r2). Suppose to the contrary that there

exists a set of orderings R0= {r1, r2} 6= R such that E(D, R0

) = 0. I show

a contradiction by identifying an observation in D which is inconsistent with

maximization of both r1 and r2.

First observe that necessarily either x1 is highest ranked in r1 and xN is highest

ranked in r2 or vice versa, since (x1, X), (xN , X) 2 D. Without loss of generality,

suppose the former. Since r1 6= r1, there exist alternatives xk, xl with k, l /2 {1, N}such that r1

k < r1l and r1

k > r1l ; that is, xk is higher ranked than xl under r1 but not

under r1. Let A be the set of all alternatives ranked lower than xk in ordering r1,

noting that xN 2 A since xN = argmini r1i by assumption. Then, choice observation

(xk, A) 2 C(r1), but is inconsistent with maximization of r2 since xN 2 A. Moreover,

(xk, A) is inconsistent with maximization of r1 since xl 2 A. This yields the desired

contradiction.

Finally, every singleton set R = {r} is trivially identified using the set of all of

145

its choice implications C(r).

146

Economic Theory and Statistical Learning

Documents

Transcript of Economic Theory and Statistical Learning