Cooperative learning equilibrium€¦ · reinforcement learning than others which are frequently...

Towards Cooperative Learning Equilibrium inReinforcement Learning(WORKING DRAFT)

Jesse Clifton∗

March 6, 2020

Abstract

I consider the problem of constructing learning equilibria in the multi-agentreinforcement learning setting. A learning equilibrium is a profile of learningalgorithms which constitute a Nash equilibrium; that is, no player can ben-efit in terms of the long-term rewards generated by the profile of learningalgorithms by deviating from that profile. I argue that learning equilibriumis a more appropriate solution concept for certain applications of multi-agentreinforcement learning than others which are frequently studied. I then pro-pose an approach to constructing learning equilibria in which reinforcementlearners jointly maximize a welfare function which represents some compro-mise between their individual payoffs. Deviations from the optimization ofthis welfare function are punished. I sketch a formal framework and presentsome proof-of-concept experiments.

1 IntroductionAs the capabilities of artificial agents improve, ensuring that these agents inter-act cooperatively an increasingly important research direction. In the setting Ienvision, multiple reinforcement learners are deployed in an initially unknownenvironment to act on behalf of their respective principals. These principals will

∗[email protected]

1

generally have differing goals, and so need to design their learners to avoid con-flict while still advancing their own goals as well as possible.

A natural criterion for rational behavior on part of the principals is whethertheir learners are in learning equilibrium (LE) (Brafman and Tennenholtz, 2003):do the learning algorithms constitute a Nash equilibrium, in terms of the expectedcumulative rewards they generate? Beyond that, the learning equilibrium shouldbe in some sense socially optimal. It should be Pareto-optimal, and beyond thatshould reflect some combination of total utility gained by all of the agents, thefairness of the distribution of utility, and other normative considerations.

The purpose of this work is to make progress towards these desiderata bysketching a new framework for multi-agent reinforcement learning. To measurethe quality of different payoff profiles, I rely on a welfare function. The learningequilibrium I construct will converge to a policy profile that maximizes this wel-fare function. It is beyond the scope of this article to argue for a particular welfarefunction, as this is ultimately an ethical judgement to be made by the human prin-cipals. For this sake of the discussion here, I refer to several widely-used welfarefunctions from the cooperative bargaining literature, which have some appealingtheoretical properties (Table 1). There are several benefits to this approach:

1. In this setting, LE is a better account of individual rationality than othersolution concepts used in multi-agent learning. The principals will wantto choose their learners strategically. This means that it is not enough fora learner to perform well against a set of non-strategic opponents (e.g.Shoham and Leyton-Brown 2008). It is also not enough that a learner con-verges to Nash equilibrium of the game which is being learned – which Icall the “base game” – in self-play (e.g. Bowling and Veloso 2001; Conitzerand Sandholm 2007; Balduzzi et al. 2018; Letcher et al. 2018), as this doesnot guarantee that a player cannot benefit by deviating from that profile (i.e.,submitting a different learning algorithm).

2. Repated and stochastic games exhibit many Nash equilibria. This one of thelessons of the folk theorems (e.g. Fudenberg and Maskin 1986), which saythat any payoff profile in which each agent gets at least what they wouldget if they were being maximally punished by their counterparts is attainedin some equilibrium. Narrowing the space of learning equilibria to thosein which a reasonable welfare function is optimized greatly improves thesituation. (While I do not pursue the possibility here, there is some hope thatthe equilibrium selection problem in this setting could be solved entirelyby pairs of learning algorithms which learn to agree on a welfare function

2

even if they start out optimizing different welfare functions; cf. Van Damme(1986)’s study of higher-order bargaining over bargaining solutions.)

Moreover, my construction draws attention to the advantages of coordina-tion by the principals on the learning algorithms used by the agents they de-ploy. The equilibrium selection problem can be solved if a welfare functionis agreed upon beforehand, even if the environment in which the reinforce-ment learners will be deployed is initially unknown. This is an improve-ment on results like that of Letcher et al. (2018), who guarantee that theiropponent-aware algorithm converges to something like a local Nash equi-librium in self-play, but provide no guarantees on the quality (with respectto Pareto optimality or social welfare) of that solution. If principals are al-ready in a setting where they can coordinate on the learning algorithms theydeploy, they can do better by taking the LE approach and ensuring that theselearning algorithms converge to a reasonable equilibrium.

3. Finding Nash equilibria is difficult in complex reinforcement learning envi-ronments (Lanctot et al., 2017). My construction splits the computation of awelfare-optimal learning equilibrium into relatively simple problems: i) theoptimization of a welfare function (essentially a single-agent reinforcementlearning problem); ii) the detection of defection (whose difficulty varies de-pending on how transparent the agents’ decision-making is to their coun-terparts); iii) and punishment (again, a relatively easy problem). (Comparewith De Cote and Littman (2012)’s efficient algorithm for computing equi-libria in stochastic games in the case where a model of the game is available,which uses the folk theorem and the egalitarian welfare function.)

Learning equilibrium was first discussed (under that name) by Brafman andTennenholtz (2003). Their setting differs in several ways from the present one,however. For stochastic games, their approach involves first finding a cooperativepolicy profile offline, and then deploying learning algorithms which enforce theusage of the cooperative policies. As I am interested in the setting where the en-vironment is initially unknown (and therefore an initial offline planning step isnot available), I instead focus on the setting of online reinforcement learning inwhich learning consists in incremental stochastic updates towards the optimizerof a value function. Second, unlike Brafman and Tennenholtz I am motivated bya normative concern with constructing learning equilibria that are optimal withrespect to an appropriate welfare function.

Several authors have recently studied cooperation among deep reinforcement

3

learners1. Peysakhovich and Lerer (2017), Lerer and Peysakhovich (2017), andWang et al. (2018) all study the setting in which agents are able to plan offline,and train punishment policies to deter defection. For these authors, cooperationcorresponds to policies which together maximize the sum of the players’ payoffs;the language of this paper, this means that they are enforcing the optimization ofa particular (utilitarian) welfare function. In the online learning setting, Zhangand Lesser (2010), Foerster et al. (2018), and Letcher et al. (2018) have stud-ied “opponent-aware learning”, in which players can see the parameters of theircounterparts’ policies. The learning algorithms of Foerster et al. and Letcher etal. are “opponent-shaping” in that they choose parameter updates which are in-tended to influence their counterparts’ parameter updates in a way that improvestheir own payoffs. Foerster et al. find that their learning algorithm, learning withopponent-learning awareness (LOLA), leads to higher rates of cooperation in it-erated Prisoner’s Dilemmas than naive learners. And, Letcher et al. show that acertain combination of LOLA and Zhang and Lesser (2010)’s “lookahead” learn-ing algorithm guarantees convergence to a stable fixed point — a point at whicheach player’s value function is smooth and is locally optimized with respect totheir own parameters — when such a point exists. However, as discussed in points1 and 2 above, this is insufficient to guarantee that either the learning algorithmsthemselves constitute a learning equilibrium, or that the stable fixed point whichis converged to constitutes a reasonable pair of policies (in the sense of scoringhigh on some measure of total welfare). I also find below that the cooperativenessof these algorithms is sensitive to their learning rate.

I focus on the case in which policy parameters are mutually visible (i.e., “per-fect policy monitoring”). I also also present initial experiments for the case inwhich only actions are mutually visible (and therefore players must infer whethertheir counterpart is following the cooperative optimization schedule). I show howto construct learning equilibria by punishing deviations from the optimization ofthe welfare function in the perfect policy monitoring case. These equilibria in-volve finite-time punishments. The the duration of punishment phases dependson the amount of data it takes to accurately estimate a punishment policy and onthe amount of time it takes for the finite-horizon value of the learned punishmentpolicy to approach its infinite-horizon average value.

1See also the survey of cooperation in deep reinforcement learning in (Hernandez-Leal, Kartal,and Taylor, 2019, Section 3.5).

4

2 Learning Equilibrium in Stochastic GamesConsider a situation in which multiple agents interact with their environment overtime, incrementally updating the policies which they use to choose actions. Theseupdates are controlled by their learning algorithms, which are fixed by their prin-cipals before the agents begin interacting with the environment. It is helpful todistinguish between three different games that are being played here:

• The base game, which is defined by the dynamics of the environment, thereward functions of the agents, and the policies available to each agent. Inthis paper, the “base game” will be a stochastic game (described below). Wecan ask whether a particular profile of policies is a Nash equilibrium in thebase game. However, as discussed in point 1 in the introduction, this wouldnot guarantee that the learning algorithms which converge to that profile ofpolicies are in equilibrium.

• The learning game, in which the principals simultaneously submit learningalgorithms to be deployed in the base game environment, and attain payoffsequal to the limit of the average rewards they generate.

• The update games at each time t, in which each player simultaneously de-cides which policy they will follow at time t + 1 based on their history ofobservations. In my construction of learning equilibria, I will define coop-eration in the update game as (roughly) taking a stochastic policy updatestep towards the optimum of the welfare function. Acting otherwise will becalled defection.

I work with stochastic games (Littman, 1994), a generalization of Markovdecision processes to the multi-agent setting. I will assume only two players, i =1, 2. For notational simplicity, I will sometimes make statements about players iand their counterpart 1, 2 \ i using the indices 1 and 2; all such statementscontinue to hold when swapping indices 1 and 2. In a stochastic game, playerssimultaneously take actions at each time t ∈ N; observe a reward depending on thecurrent state; and the state evolves according to a Markovian transition function.Formally, a 2-player stochastic game is a tuple (S,A1,A2, r1, r2, P ), where

• S is the state space, with the state at time t denoted St;

• Ai is the space of actions (assumed to be finite) available to player i, withthe action taken by player i at time t denoted as Ati;

5

• ri is the reward function of player i, with the reward gained by player 1 attime t given by Rt

1 = r1(St, At1, At2);

• P is a Markovian transition function, such that the conditional distributionover states at time t + 1 is given by P (St+1 ∈ B | Sv, Av1, Av2tv=0) =P (St+1 ∈ B | St, At1, At2) for measurable B ⊆ S.

Define a (stationary) policy πi to be a mapping from states to random variablestaking values in Ai, such that Pπi(s) = a is the probability that player i takesaction a in state s when following policy πi. Players wish to learn a policy which(in some sense) leads to large cumulative reward. In the online setting, each playeris learning over a space of policies ΠΘ = πθ : θ ∈ Θ for some parameter spaceΘ. Correspondingly, define for parameter profiles θ = (θ1, θ2) (and arbitrary initialstate s0) the policy-value functions

V1(θ) = limT→∞

T−1Eπθ1 ,πθ2

(T∑t=1

Rt1 | S0 = s0

),

where Eπθ1 ,πθ2 denotes the expectation with respect to trajectories in which playersfollow policies πθ1 , πθ2 at each time step. That is, V1(θ) is player 1’s expectedaverage return when player 1 follows πθ1 and player 2 follows πθ2 . This limit isguaranteed to exist by the ergodicity assumption which is made in the constructionof learning equilibria (Theorem 3.1).

Let H t be the history of observations until time t. I will assume that bothagents fully observe the state and each others’ actions and rewards, such that(Sv, Av1, Av2, Rv

1, Rv2)tv=0 ∈ H t. I distinguish between the cases in which the

players are able to see the policy generating their counterpart’s actions, and thosein which they can only see the actions.

Definition 2.1 (Policy monitoring). The set of histories H satisfies perfect policymonitoring if for each H = H0, H1, . . . ∈ H each t, and i = 1, 2,

(θv1 , θv2)tv=1 ∈ H t, where Avi ∼ πθvi (Sv). (1)

H exhibits imperfect policy monitoring if (1) is not satisfied for some H t.

Definition 2.2 (Learning algorithm). A learning algorithm for player i is a map-ping σi from histories H t to parameters, i.e., σi (H t) ∈ Θ.

6

Define the value to player 1 of learning algorithm profile (σ1, σ2) starting fromhistory H t as

V1(H t, σ1, σ2) = lim infT→∞

T−1Eσ1,σ2

(t+T∑v=t

Rv1 | H t

),

where Eσ1,σ2 is the expectation taken with respect to trajectories in which agentsfollow learning algorithms σ1, σ2 at each step. Let the initial history be H0 =(S0, θ0

1, θ02). Then, the game described above may be written in normal form as

a game with strategy spaces Σ1,Σ2 corresponding to spaces of possible learningalgorithms, and player 1’s utilities at each profile (σ1, σ2) given by V1(H0, σ1, σ2).This a learning game. A learning equilibrium is a Nash equilibrium of a learninggame:

Definition 2.3 (Learning equilibrium). Learning algorithm profile (σ1, σ2) is anlearning equilibrium of the learning game with learning algorithm spaces Σ1,Σ2

ifsup

σ∈Σ1\σ1V1(H0, σ, σ2) ≤ V1(H0, σ1, σ2) and

supσ∈Σ2\σ2

V2(H0, σ1, σ) ≤ V2(H0, σ1, σ).(2)

3 Bargaining and EnforcementThe equilibrium learning algorithm profiles I construct will converge to the op-timizer of a welfare function. This function is intended to encode a compromisebetween the principals’ individual payoffs. One intuitive welfare function is theutilitarian welfare (sometimes just called the “social welfare”). For player util-ity functions ui and outcomes x, this function calculates welfare according towUtil(x) = u1(x) + u2(x). Another welfare function is the Nash welfare (Nash,1950). This welfare function also depends on “disagreement” utilities udi , whichrepresents the value a player might get if the parties fail to compromise. (For in-stance, Felsenthal and Diskin (1982) argue that the disagreement point for a playershould be the minimum utility they receieve in any Pareto-optimal outcome.) Nashshowed that the only way of deciding on an outcome x which satisfied some nat-ural axioms was arg maxxw

Nash u1(x), u2(x), where wNash u1(x), u2(x) =u1(x)− ud1

×u2(x)− ud2

. Table 1 contains a list of welfare functions cor-

responding to widely-discussed bargaining solutions and a discussion of some oftheir properties.

7

Name of welfare function w Form of w V1(θ), V2(θ)

Nash (Nash, 1950)[V1(θ)− V d

1

]·[V2(θ)− V d

2

]Kalai-Smorodinsky

(Kalai and Smorodinsky, 1975)

V1(θ)2 + V2(θ)2

−ιV1(θ)−V d1V2(θ)−V d2

=supθ1,θ2 V1(θ1,θ2)−V d1supθ1,θ2 V2(θ1,θ2)−V d2

Egalitarian (Kalai, 1977) min

V1(θ)− V d

1 , V2(θ)− V d2

Utilitarian (e.g. Harsanyi 1955) V1(θ) + V2(θ)

Table 1: Welfare functions corresponding to several widely-discussed bargainingsolutions, adapted to the multi-agent RL setting where two agents with value func-tions V1, V2 are bargaining over the pair of policies with parameters θ = (θ1, θ2)to deploy. The function ι in the definition of the Kalai-Smorodinsky welfare is the∞-0 indicator, used to enforce the constraint in its argument.

Note that in bargaining problems in which the set of feasible payoffs isconvex, the Nash welfare function uniquely satisfies the properties of (1) Paretooptimality, (2) symmetry, (3) invariance to affine transformations, and (4) inde-pendence of irrelevant alternatives. See Zhou (1997) for a characterization of theNash welfare in non-convex bargaining problems. The Nash welfare can also beobtained as the subgame perfect equilibrium of an alternating-offers game as the“patience” of the players goes to infinity (Binmore, Rubinstein, and Wolinsky,1986).

On the other hand, Kalai-Smorodinsky uniquely satisfies (1)-(3) plus (5)resource monotonicity, which means that all players are weakly better off whenthere are more resources to go around. The egalitarian solution instead satisfies(1), (2), (4), and (5). The utilitarian welfare function is implicitly used in Braf-man and Tennenholtz (2003)’s initial work on learning equilibrium and the workof Peysakhovich and Lerer (2017); Lerer and Peysakhovich (2017); Wang et al.(2018) on cooperation in sequential social dilemmas.

8

The welfare functionw will act on policy profiles θ, i.e., writew(θ) = w V1(θ), V2(θ).Write the set of (locally) welfare-optimal policy profiles as Θ2,C . As discussedbelow, we expect to be able to construct algorithms which converge to a policyprofile in Θ2,C using standard reinforcement learning methods. However, in or-der to construct a learning equilibrium, we also need to be able to disincentivizedefections from an algorithm which optimizes θ. For this reason, I introduce apunishment point V p

i for player i. To construct an equilibrium, we will need to as-sume that a player’s punishment payoffs are worse than their bargaining payoffs.That is, there exists a punishment policy parameter θp1 such that

supθ2

V2(θp1, θ2) = V p2 < inf

θ∈Θ2,CV2(θ).

3.1 Learning and enforcementWe focus on the case of perfect policy monitoring (Definition 2.1).

Inadequacy of base-game solution concepts

Perfect policy monitoring is the setting of learning with opponent-learning aware-ness (LOLA) (Foerster et al., 2018):

σLOLA1 (H t) = θt1 +∇θ1V1(θt)δt

+∇θ2V1(θt)

T∇θ1∇θ2V2(θt)δtξt,(3)

for step sizes δt, ξt. LOLA acts as if it is the Stackleberg leader (e.g. Colman,Pulford, and Lawrence 2014) of the update game. That is, it acts as if it can com-mit to an update at time t and that the counterpart’s update at time t is a (local)best response to this move. (One-step LOLA is the rule in display 3, while higher-order LOLA accounts for player 2’s modeling of 1’s updates.) While Foerster et al.(2018) do not find evidence that first-order LOLA is exploitable by higher-orderLOLA in their preliminary experiments, it is an open question how exploitableLOLA (and its successor, stable opponent shaping (SOS; Letcher et al. 2018)) isby higher-order opponent-shaping rules. Indeed, it is clear that non-equilibriumupdate rules like LOLA can be (at least locally) exploitable by learning algo-rithms of the form in display 4 which simply take the local best response to thecounterpart’s next iterate:

σLOLA-exploiter2 (H t) = arg max

‖θ2−θt2‖<δtV2

σLOLA

1 (H t), θ2

. (4)

9

Moreover, if we want to guarantee welfare-optimality (and, a fortiori, Pareto-optimality), we cannot rely on learning algorithms which use base-game solutionconcepts. Recent base-game solution concepts in the perfect policy monitoringsetting include:

1. Local Nash equilibrium (e.g. Balduzzi et al. 2018): A policy profile in-dexed by θ is a local Nash equilibrium if, for each i, there exists a neighbor-hood Θ1 of θ1 such that V1(θ) ≥ V1(θ1, θ2) for θ1 ∈ Θ1.

2. Stable fixed point (Letcher et al., 2018): A policy profile indexed by θ isa stable fixed point if, for each i, ∇θiVi(θ) = 0 and if the Hessian of thefunction [V1(θ), V2(θ)] is positive semidefinite at θ.

In each case it is easy to find examples of games and policy profiles satisfyingthe solution concept but which are Pareto sub-optimal. For instance, consider theiterated Prisoner’s dilemma with memory-1 policies indexed by parameters θaCi =P (Ati = C|At−1

−i = a) for a ∈ Cooperate (C),Defect (C). Then the profilecorresponding to constant defection (θCCi , θDCi ) = (0, 0) for i = 1, 2 is a localNash equilibrium but Pareto-suboptimal.

This point is not smooth and so the stable fixed point concept does not apply.But examples of stable fixed points which are Pareto-suboptimal are also easy tofind. For instance, consider the “cluttered coordination game” (Table 2). Considerpolicies parameterized as P (Ai = j) ∝ exp(θij) for j ∈ A,B,C. The onlyfully mixed Nash equilibrium, and therefore the only stable fixed point of thisgame is given by policies with parameters θi = [logit(0.4), logit(0.4), logit(0.2)], i =1, 2. But this profile is Pareto-suboptimal. This observation is also borne out em-

Player 2A B C

Player 1A 1, 1 0, 0 0, 0B 0, 0 1, 1 0, 0C 0, 0 0, 0 2, 2

Table 2: Cluttered coordination game payoffs.

pirically by the sensitivity of SOS (Letcher et al., 2018) to the algorithm’s learningrate. While Letcher et al. (2018) report that SOS method empirically leads to con-vergence in the iterated Prisoner’s dilemma with a learning rate α = 1.0, changing

10

Figure 1: Sensitivity of SOS to the learning rate α in the iterated Prisoner’sDilemma. Replication of Letcher et al. (2018)’s experiments with different learn-ing rates. Losses of 1.0 correspond to mutual cooperation, and losses of 2.0 cor-respond to mutual defection.

the learning rate leads to mutual defection in some cases (Figure 1) 2. This shouldnot be surprising, as SOS does not guarantee convergence to a Pareto-optimalpolicy profile.

Construction of welfare-optimal learning equilibria

Let δt∞t=1 be a sequence of step sizes for the policy parameter updates. Let w bean estimator of the welfare function. Let OC be a base policy learning algorithmwhich maps histories H t to candidate updates (θt1, θ

t2). It is simplest to think of

2The replications were done using the Github repository at https://github.com/aletcher/stable-opponent-shaping (accessed February 12, 2020). Thanks to Adrian Hutter for making me awareof the sensitivity of LOLA to the learning rate.

11

OC as an incremental algorithm of the form

OC(H t) = arg max‖∆θi‖<δt,i=1,2

w(θt1 + ∆θ1, θt2 + ∆θ2).

Policy gradient methods (Williams, 1992; Kakade, 2002) take this form or a sim-ilar one, for instance.

Write the updated parameters returned by OC as θC,t+1 = OC(H t). We saythat player i defects at time t if θti 6= θC,ti , and construct a strategy which pun-ishes defections. The learning equilibrium construction presented here dividestime points into blocks, and deploys either a cooperative or punishment learningalgorithm in each block depending on whether there was a defection in the pre-vious block. (This block construction is similar to that used in Wiseman (2012)’sconstruction of equilibria for repeated games in which players only see noisy ob-servations of the underlying stage game.) Specifically, in the bth block of lengthM b +N b:

1. If the other player cooperated at each time point in the previous block, then

• Deploy cooperative policy learner OC for M b steps to obtain a goodestimate of the welfare-optimal policy θb,Ci ;

• Deploy policy θb,Ci for N b time steps to generate payoffs close to V Ci .

2. If the other player defected at any time point in the previous block, then

• Deploy punishment policy learner Op for M b steps to obtain a goodestimate of the punishment policy θb,pi ;

• Deploy policy θb,pi for N b time steps to generate payoffs close to V pi .

This algorithm is summarized in Algorithm 3.

Theorem 3.1 (Construction of welfare-optimal learning equilibrium). Considera learning game with perfect policy monitoring. Fix a welfare function w, coop-erative policy optimizer OC , and punishment policy optimizer Op. Consider thecorresponding learning algorithms (σ1, σ2) constructed as in Algorithm 3.

Let tb =∑

b′≤b(Mb′ + N b′), i.e., tb is the time at which the bth block starts.

Write V t:t′(H t, θ) = (t′ − t)−1Eθ(∑t′

v=tRv | H t

). Use a similar notation for

12

the average value attained in over time points t, . . . , t′ under combinations of sta-tionary policies θi and learning algorithms σi. For ease of notation, define thefollowing quantities:

V b,C2 = V tb+Mb:tb+Mb+Nb

(H tb , θb,C1 , θb,C2 );

V b,C2 = V tb+Mb:tb+Mb+Nb

(H tb , θC1 , θC2 );

V b,p2 = V tb+Mb:tb+Mb+Nb

(H tb , θb,p1 , σ2);

V b,p2 = V tb+Mb:tb+Mb+Nb

(H tb , θp1, θp2).

Make the following assumptions:

1. For some rCδ ∈ (0, 1), V b,C2 − V b,C

2 converges in probability to 0 at a (tb)rCδ

rate;

2. For some rpδ ∈ (0, 1), V b,p2 − V b,p

2 converges in probability to 0 at a (tb)rpδ

rate;

3. For a Markov chain indexed by policy profile with parameter θ, let Pθ beits transition operator and νθ its stationary distribution. Then the stochas-tic game is uniformly ergodic (e.g. Ortner 2018) in the sense that, for anymeasure over states ν,

supθ‖P t

θν − νθ‖ ≤ Kγt.

for constants K and γ;

4. For any measure ν and policy profile parameter θ, ν and νθ are dominatedby measure µ and the reward function satisfies

∫|r2 St, πθ(St)|dµ(St) ≤

|R2|µ <∞;

5. The reward function is bounded, such that the most by which player 2’saverage payoff in a given interval can exceed V C

2 is bounded by ∆2 <∞;

6.∑

b≤BMb∑

b≤B(M b +N b

)−1 −→ 0.

Under Assumptions 1-5, (σ1, σ2) learning equilibrium and is consistent for thebargaining solution.

Proof. In Appendix A.

13

It is beyond the scope of this article to analyze the conditions for of Theo-rem 3.1 to hold. But, I note that the first assumption is a standard desideratum forlearning algorithms in single-agent reinforcement learning. The second assump-tion is a standard desideratum in the case of zero-sum games, which can be usedto construct a punishment policy by setting the punishment policy equal to theone which minimizes the other player’s best-case payoff. For instane, Yang, Xie,and Wang (2019) find that deep Q-network (DQN; Mnih et al. 2013) estimatorsconverge to the optimal Q-function at a maxj(t

b)α∗j/(2α

∗j+`j) rate, where the op-

timal Q-function can be written as a composition of sparse functions which areα∗j -Hölder smooth and depend on `j of their inputs. This result extends to a DQNvariant of minmax Q-learning (Littman, 1994), which can be used to construct apunishment policy.

Also note that several of the welfare functions in Table 1 straightforwardlyinherit useful properties with respect to single-agent policy-learning methods. Forinstance, the utilitarian welfare function admits the standard Bellman equations,and because it is simply the sum of individuals’ value functions, has an unbi-ased gradient estimator whenever the individual value functions have an unbiasedgradient estimator. As a sum of a monotone transformation of the value func-tions, the Nash welfare inherits improvement guarantees — for instance, thosewhich form the theoretical background to trust region policy optimization (Schul-man et al., 2015). Regarding the construction of punishment policies, reinforce-ment learning for zero-sum games is relatively well understood in comparisonwith general-sum games. Several algorithms with theoretical convergence guar-antees exist for policy-learning in extensive form games (Srinivasan et al., 2018;Lockhart et al., 2019), and model-free deep reinforcement learning has seen high-profile successes in complex zero-sum games (Vinyals et al., 2019; Schrittwieseret al., 2019).

Lastly, uniform ergodicity (Assumption 3) is a standard assumption in the the-oretical analysis of reinforcement learning; see e.g. Ortner (2018) and referencestherein. Of course, this may be a highly unrealistic assumption. See AppendixB for discussion of an “ex interim” variant of the learning equilibrium solutionconcept, which drops the ergodicity assumption, making it applicable to a muchwider class of environments (but losing guarantees on the payoffs generated bythe learning algorithms).

14

3.2 Illustration: Iterated Prisoner’s DilemmaI now describe simulations illustrating the enforcement of steps towards the bar-gaining solution, using the iterated Prisoner’s Dilemma (Table 3). We treat thisiterated game as a stochastic game in which the state at time t is given by the ac-tions taken at the last time step, i.e., St = (At−1

1 , At−12 ). Player 1’s policy πθ1 gives

the probability of each action given player 2’s action at the previous timestep. Inparticular, θi = (θi,11, θi,12, θi,21, θi,22), and

Pπθ1(St) = a = Pπθ1(At−12 ) = a

∝ exp(θ1,At−12 a).

We write the (mean) reward function as ri(a1, a2), dropping state argument be-cause rewards are independent of state here. We introduce stochasticity by draw-ing rewards Rt

1 ∼ N r(At1, At2), 0.1.For the base cooperative policy optimization algorithm OC we use an off-

policy variant of Williams (1992)’s REINFORCE. Let θC,t be the parameter re-turned by taking a policy gradient step on the welfare function at time t. I usethe utilitarian welfare function wUtil and the Nash welfare wNash (Table 1). Defineimportance sampling weights

ρv(θ) =Pπθ1(Sv) = Av1Pπθ2(Sv) = Av2

PπθC,v1(Sv) = Av1PπθC,v2

(Sv) = Av2.

For a window size u (u = 20 in my experiments), define normalized weightsρv(θ) = ρv(θ)∑t

u=t−s ρu(θ)

and estimated welfare functions

wUtil,t(θ) =t∑

v=t−s

ρv(θ) (Rv1 +Rv

2)

wNash,t(θ) = log

t∑

v=t−s

ρv(θ)Rv1 −min

a1maxa2

r2(a1, a2)

+ log

t∑

v=t−s

ρv(θ)Rv2 −min

a2maxa1

r1(a1, a2)

.

Then, for wt ∈ wUtil,t, wNash,t,

θC,t+1 = OC(H t) = θC,t +∇θwt(θ).

15

Perfect policy monitoring

We first consider the perfect policy monitoring (PM) setting. I consider a tit-for-tat (TFT) policy in which deviations from the optimization schedule are punishedfor a single time period using the estimated minimax reward strategy. The esti-mated reward function rt2 is simply given by the sample mean rewards at observedpairs of actions until time t. The actions taken by a player following the resultinglearning algorithm, which I call σTFT−PM, are given in display 5.

At1 ∼

πθC,t1

(St), if w(θC,t1 , θt2)− w(θC,t) < 0.01;

mina1 maxa2 r2(a1, a2), otherwise.(5)

TFT-PM(wUtil) self-play TFT-PM(wNash) self-play

Figure 2: Performance of different learning algorithms in iterated Prisoner’sDilemma. The prob axis is the probability of mutual cooperation (CC) anddefection (DD) under the policy profile at the corresponding time step, and thepayoffs axis is the reward for each player at the corresponding time steps.

Figure 2 summarizes the performance of σTFT−PM played against itself forthe variant of IPD described above. As expected, σTFT−PM converges to mutualcooperation.

Imperfect policy monitoring

I also ran some exploratory experiments for the imperfect policy monitoring set-ting. At each step, each player compares the likelihood of their counterpart’s ac-

16

tions under the cooperative sequence of parameter updates to the maximum like-lihood stationary policy. This stationary policy estimate represents the hypothesisthat the counterpart is not following the cooperative learning algorithm. If the like-lihood of the maximum likelihood stationary policy is sufficiently higher than thatof the cooperative sequence of updates, the estimated minimax action is played. Iconstruct a (not fully rigorous) test which assumes that the average log likelihoodsunder each hypothesis are asymptotically normally distributed, and reject the hy-pothesis that one’s counterpart is cooperating when the difference of these loglikelihoods (suitably normalized) is large in the positive direction. See AppendixC for details.

The actions of a player following σTFT−NoPM(H t) are given by:

At1 ∼

πθC,t1

(St), if hypothesis that counterpart is

cooperating is not rejected,mina1 maxa2 r

t2(a1, a2), otherwise.

Figure 3 shows the results of σTFT−NoPM in self-play. This learning algorithmprofile is still able to learn the mutual-cooperation policy profile, and the averagepayoffs are close to those gained when both agents cooperate at each time step(i.e., −1). But the payoffs remain sub-optimal due to occasional false detectionsof defection (and consequent punishments). Figure 3 also shows the results ofσTFT−NoPM against a naive policy gradient learner. Each player’s average rewardconverges to approximately −2, the payoff at mutual defection. So σTFT−NoPM isable to correctly detect defection and avoid exploitation in this case.

4 DiscussionI have sketched a framework for understanding individual rationality and cooper-ation on part of the principals of reinforcement learners. It is of paramount impor-tance that powerful machine intelligences avoid conflict when they are eventuallydeployed. One way to avoid such outcomes is to construct profiles of strategieswhich rational actors prefer to profiles that lead to conflict. I have argued that theappropriate notion of rationality in this context is learning equilibrium (rather thanmere convergence to Nash equilibrium, for instance). I have taken initial steps to-wards showing how learning equilibria which are cooperative — in the sense ofoptimizing a reasonable welfare function — can be constructed.

17

TFT-NoPM(wUtil) self-playTFT-NoPM(wUtil) (player 1)

against Naive (player 2)

Figure 3: Performance of σTFT−NoPM against itself (left) and against a naive RE-INFORCE learner (right) in the iterated Prisoner’s Dilemma. The prob axis isthe probability of mutual cooperation (CC) and mutual defection (DD) under thepolicy profile at the corresponding time step, and the payoffs axis is the rewardfor each player at the corresponding time steps. The cumulative_payoffsaxis is the average reward accumulated until the corresponding timestep. Thesevalues are averaged over 20 replicates.

Many open questions remain. First, while I have relied on a welfare function,I have said little about which welfare function should be used. As alluded to inpoint 2 in the introduction, the ideal scenario would be for the principals of thereinforcement learning systems in question to coordinate on a welfare functionbefore deploying these systems. Another direction worth exploring is the designof learning algorithms which detect whether their counterparts are optimizing awelfare function other than their own, and adjust accordingly to optimize somecompromise welfare function.

The framework I present is limited in many ways. I am currently working toconstruct learning equilibria in the imperfect policy monitoring case using test-ing procedures like that described in Section 3.2 and Appendix C. Beyond that,the theory should be extended to the case of partial state observability, more thantwo agents, and criteria other than average-reward. I also have assumed that theagents’ reward functions are known, but a complete framework should address

18

the problem of incentives to misrepresent one’s reward function to improve one’spayoffs in the bargaining solution. This raises questions of mechanism design; seeNisan and others (2007, Ch. 9) for an introduction aimed at computer scientists.Note that the use of mechanisms in which principals truthfully reveal their rewardfunctions may greatly restrict the set of available welfare functions. It is well-known, for instance, that the Vickrey-Clarkes-Grove family of mechanisms —which implement the utilitarian welfare function — is the only family of mecha-nisms which are both efficient and truthful (Green and Laffont, 1977; Holmström,1979).

In terms of implementation, it may be necessary to develop novel reinforce-ment learning methods tailored for the optimization of different welfare functions.For instance, I have focused on policy-based reinforcement learning, but the de-velopment of value-based methods would likely broaden the space of practicalalgorithms available for implementing my approach. This is nontrivial, as thewelfare functions in Table 1 other than the utiliarian welfare do not straightfor-wardly admit a Bellman equation. The question of computational feasibility mayalso interact in fruitful ways with the problem of selecting a welfare function.Computational considerations may eliminate welfare functions, such as Kalai-Smorodinsky, which are much harder to optimize than others.

Among the most serious limitations of the framework presented here are theergodicity assumptions which I used to construct equilibria. Ergodicity is not a re-alistic assumption about the real world. A more realistic framework might involve“ex-interim”3 learning equilibrium concepts, such as that sketched in Appendix B.

AcknowledgementsMany thanks to Tobias Baumann, Daniel Kokotajlo, Johannes Treutlein, and anony-mous reviewers for helpful comments on drafts of this document.

ReferencesAthreya, K. B., and Lahiri, S. N. 2006. Measure theory and probability theory.

Springer Science & Business Media.

Balduzzi, D.; Racaniere, S.; Martens, J.; Foerster, J.; Tuyls, K.; and Graepel,

3Cf. the distinction between ex-interim and ex-post equilibria in Bayesian games.

19

T. 2018. The mechanics of n-player differentiable games. arXiv preprintarXiv:1802.05642.

Binmore, K.; Rubinstein, A.; and Wolinsky, A. 1986. The nash bargaining solu-tion in economic modelling. The RAND Journal of Economics 176–188.

Bowling, M., and Veloso, M. 2001. Rational and convergent learning in stochasticgames. In International joint conference on artificial intelligence, volume 17,1021–1026. Lawrence Erlbaum Associates Ltd.

Brafman, R. I., and Tennenholtz, M. 2003. Efficient learning equilibrium. InAdvances in Neural Information Processing Systems, 1635–1642.

Colman, A. M.; Pulford, B. D.; and Lawrence, C. L. 2014. Explaining strate-gic coordination: Cognitive hierarchy theory, strong stackelberg reasoning, andteam reasoning. Decision 1(1):35.

Conitzer, V., and Sandholm, T. 2007. Awesome: A general multiagent learning al-gorithm that converges in self-play and learns a best response against stationaryopponents. Machine Learning 67(1-2):23–43.

De Cote, E. M., and Littman, M. L. 2012. A polynomial-time nash equilibriumalgorithm for repeated stochastic games. arXiv preprint arXiv:1206.3277.

Felsenthal, D. S., and Diskin, A. 1982. The bargaining problem revisited: mini-mum utility point, restricted monotonicity axiom, and the mean as an estimateof expected utility. Journal of Conflict Resolution 26(4):664–691.

Foerster, J.; Chen, R. Y.; Al-Shedivat, M.; Whiteson, S.; Abbeel, P.; and Mordatch,I. 2018. Learning with opponent-learning awareness. In Proceedings of the17th International Conference on Autonomous Agents and MultiAgent Systems,122–130. International Foundation for Autonomous Agents and MultiagentSystems.

Fudenberg, D., and Maskin, E. 1986. The folk theorem in repeated games withdiscounting or with incomplete information. Econometrica 54(3):533–554.

Green, J., and Laffont, J.-J. 1977. Characterization of satisfactory mechanismsfor the revelation of preferences for public goods. Econometrica: Journal ofthe Econometric Society 427–438.

20

Harsanyi, J. C. 1955. Cardinal welfare, individualistic ethics, and interpersonalcomparisons of utility. Journal of political economy 63(4):309–321.

Hernandez-Leal, P.; Kartal, B.; and Taylor, M. E. 2019. A survey and critique ofmultiagent deep reinforcement learning. Autonomous Agents and Multi-AgentSystems 33(6):750–797.

Holmström, B. 1979. Groves’ scheme on restricted domains. Econometrica:Journal of the Econometric Society 1137–1144.

Kakade, S. M. 2002. A natural policy gradient. In Advances in neural informationprocessing systems, 1531–1538.

Kalai, E., and Smorodinsky. 1975. Other solutions to nash’s bargaining problem.Econometrica 43(3):513–518.

Kalai, E. 1977. Proportional solutions to bargaining situations: interpersonalutility comparisons. Econometrica: Journal of the Econometric Society 1623–1630.

Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.; Tuyls, K.; Pérolat, J.; Sil-ver, D.; and Graepel, T. 2017. A unified game-theoretic approach to multiagentreinforcement learning. In Advances in Neural Information Processing Systems,4190–4203.

Lerer, A., and Peysakhovich, A. 2017. Maintaining cooperation in complex socialdilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068.

Letcher, A.; Foerster, J.; Balduzzi, D.; Rocktäschel, T.; and Whiteson, S.2018. Stable opponent shaping in differentiable games. arXiv preprintarXiv:1811.08469.

Littman, M. L. 1994. Markov games as a framework for multi-agent reinforce-ment learning. In Machine learning proceedings 1994. Elsevier. 157–163.

Lockhart, E.; Lanctot, M.; Pérolat, J.; Lespiau, J.-B.; Morrill, D.; Timbers, F.; andTuyls, K. 2019. Computing approximate equilibria in sequential adversarialgames by exploitability descent. arXiv preprint arXiv:1903.05614.

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.;and Riedmiller, M. 2013. Playing atari with deep reinforcement learning. arXivpreprint arXiv:1312.5602.

21

Nash, J. F. 1950. The bargaining problem. Econometrica: Journal of the Econo-metric Society 155–162.

Nisan, N., et al. 2007. Introduction to mechanism design (for computer scientists).Algorithmic game theory 9:209–242.

Ortner, R. 2018. Regret bounds for reinforcement learning via markov chainconcentration. arXiv preprint arXiv:1808.01813.

Peysakhovich, A., and Lerer, A. 2017. Consequentialist conditional cooperation insocial dilemmas with imperfect information. arXiv preprint arXiv:1710.06975.

Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt,S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T.; et al. 2019. Masteringatari, go, chess and shogi by planning with a learned model. arXiv preprintarXiv:1911.08265.

Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trustregion policy optimization. In International conference on machine learning,1889–1897.

Shoham, Y., and Leyton-Brown, K. 2008. Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press.

Srinivasan, S.; Lanctot, M.; Zambaldi, V.; Pérolat, J.; Tuyls, K.; Munos, R.; andBowling, M. 2018. Actor-critic policy optimization in partially observable mul-tiagent environments. In Advances in Neural Information Processing Systems,3422–3435.

Van Damme, E. 1986. The nash bargaining solution is optimal. Journal ofEconomic Theory 38(1):78–100.

Vinyals, O.; Babuschkin, I.; Chung, J.; Mathieu, M.; Jaderberg, M.; Czarnecki,W. M.; Dudzik, A.; Huang, A.; Georgiev, P.; Powell, R.; et al. 2019. Alphastar:Mastering the real-time strategy game starcraft ii. DeepMind Blog.

Wang, W.; Hao, J.; Wang, Y.; and Taylor, M. 2018. Towards cooperation insequential prisoner’s dilemmas: a deep multiagent reinforcement learning ap-proach. arXiv preprint arXiv:1803.00162.

Williams, R. J. 1992. Simple statistical gradient-following algorithms for connec-tionist reinforcement learning. Machine learning 8(3-4):229–256.

22

Wiseman, T. 2012. A partial folk theorem for games with private learning. Theo-retical Economics 7(2):217–239.

Yang, Z.; Xie, Y.; and Wang, Z. 2019. A theoretical analysis of deep q-learning.arXiv preprint arXiv:1901.00137.

Zhang, C., and Lesser, V. 2010. Multi-agent learning with policy prediction. InTwenty-Fourth AAAI Conference on Artificial Intelligence.

Zhou, L. 1997. The nash bargaining theory with non-convex problems. Econo-metrica: Journal of the Econometric Society 681–685.

Appendix A: Proof of Theorem 3.1Proof. For brevity, write V2(θC) = V C

2 and V2(θp) = V p2 . Let χb indicate whether

player 2 cooperated in block b (setting χ−1 = 1).Notice that, under Assumption 5, we can bound the difference between V C

2 andthe finite-time average reward under the cooperative policy in block b as follows:

V tb:tb+Mb+Nb

2 (H tb , σC1 , σC2 )− V C

2 ≤ (M b +N b)−1M b∆2 +N b

(V b,C

2 − V C2

).

And by Assumptions 1, 3, and 4, there exist constants KC rCb , and rCε < rCδ suchthat

V b,C2 − V C

2 =(V b,C

2 − V b,C2

)+(V b,C

2 − V C2

)≤ (tb)−r

Cε PV b,C

2 − V b,C2 < (tb)−r

Cε

+ ∆2P

V b,C

2 − V b,C2 > (tb)−r

Cε

+ (N b)−1

tb+Mb+Nb∑t=tb+Mb

∫r2

St, πθb,C (St)

dP t

θb,Cν(St

b+Mb

)(St)− dνθ(St)

≤ (tb)−rCε

1− (tb)−r

Cb (rCδ −r

Cε )KC

+ ∆2(tb)−r

Cb (rCδ −r

Cε )KC

+K|R2|µN b(1− γ)

−1(

1− γ−Nb)

23

By Assumptions 2, 3, and 4 the same holds for the corresponding punishmentpoint quantities, making the necessary changes. Moreover, by Assumption 6,

(tB)−1∑b≤B

(1− γ−Nb

)≤ (tB)−1B

−→ 0.

We will now write the value attained by player 2 in the first B blocks in a waythat allows us to apply these convergence statements. Then we will take expecta-tions and apply Assumption 3 to show that each term goes to 0. Letting Ω be aset of measure 1 and χb(ω) the value of χb along an arbitrary sample path ω ∈ Ω,write B

D= supω∈Ω lim supB→∞B

−1∑

b≤B

1− χb(ω)

. And, write

1b,C< = 1

V b,C

2 < V b,C2 + (tb)−r

Cε

;

1b,p< = 1

V b,p

2 < V b,p2 + (tb)−r

pε

.

24

Then,

V tB

2 − V C2 = (tB)−1

B−1∑b=0

(tb+1 − tb)E(V b2 − V C

2 )

≤ (tB)−1

B−1∑b=0

[M b∆2 +N bE

χb−1

(V b,C

2 − V C2

)+ (1− χb−1)

(V b,p

2 − V C2

)]= (tB)−1

B−1∑b=0

[M b∆2 +N bE

χb−1

(V b,C

2 − V C2

)+ (1− χb−1)

(V b,p

2 − V p2 + V p

2 − V C2

)]= (tB)−1

B−1∑b=0

[M b∆2 +N bE

χb−1

(V b,C

2 − V C2

)+ (1− χb−1)

(V b,p

2 − V p2

)]+B

D (V p

2 − V C2

)= (tB)−1

B−1∑b=0

M b∆2

+ (tB)−1

B−1∑b=0

N bEχb−1

(V b,C

2 − V C2

)+ (1− χb−1)

(V b,p

2 − V p2

)]+B

D (V p

2 − V C2

)≤ (tB)−1

B−1∑b=0

M b∆2

+ (tB)−1

B−1∑b=0

N bE[(

(tb)−rCε + ρ(N b)

)1b,C< + ∆2

(1− 1

b,C<

)+(

(tb)−rpε + ρ(N b)

)1b,p< + ∆2

(1− 1

b,p<

) ]+B

D (V p

2 − V C2

).

(6)

25

Now, the quantity on the right hand side of in display 6 is bounded above by

(tB)−1

B−1∑b=0

M b∆2 +

(tB)−1

B−1∑b=0

N b( [(

(tb)−rCε +Kγ−N

b)(

1− (tb)−rCδ

)+ ∆2K

C(tb)−rCb (rCδ −r

Cε )]

+[(

(tb)−rpε +Kγ−N

b)(

1− (tb)−rpδ

)+ ∆2K

p(tb)−rpb (rpδ−r

pε )] )

+BD

(V p2 − V C

2 )

−→ BD

(V p2 − V C

2 ).

Thus, V2(H t, σ1, σ2) ≤ 0, with strict inequality if BD> 0.

Moreover, Assumption 1 implies that V2(H0, σ1, σ2) = V C2 .

Appendix B: Ex interim learning equilibriumThe ergodicity assumption I used to construct learning equilibria is not realisticfor the kinds of environments generally intelligent agents might be deployed in.Here I propose a solution concept which is applicable in general environments.The solution concept applies to agents for whom Bayesian planning is intractable.However, we lose the ex post guarantees on the agents’ long-run payoffs.

In (model-based) ex interim learning equilibrium, the agents’ decisions at timet are obtained by planning against an estimated model Mt. The cooperative policyat each time step is the one that maximizes the estimate of the welfare functionobtained using the estimated model Mt. Then, the cooperative action is just theaction recommended by this cooperative policy. If an agent defects at time t, theyare punished according to the their minimax value under estimated model Mt.

I do not specify conditions on the estimator M. The idea is that, in practice,the principals must agree on what a reasonable belief-forming process looks likebefore deploying their agents4. One natural requirement is that the estimator beconsistent, i.e., that it converge to the true environment dynamics as data accu-mulates. However, as I have not specified the class of generative models, even

4While this might seem unsatisfying, the need for agreement on other parameters (the payoffs,the common prior, the transition dynamics of the stochastic game, equilibrium selection principle,etc.) has has always been present in game theory.

26

consistency might be too strong a requirement. Nevertheless, the hope is that thisframework will least allow us to say:

1. If the estimator is consistent, then an ex interim learning equilibrium isindeed an ex post learning equilibrium — that is, it is a learning equilibriumwith respect to the principals’ true payoffs under the submitted learningalgorithms;

2. Even if the model is misspecified, agents playing an ex interim learningequilibrium are at least playing efficiently with respect to their “best guess”about the environment.

Define w(H t, θ;M) to be the welfare under policy profile indexed by θ starting athistory H t whenM is the true generative model. Similarly define value functionsVi(H

t, θ;M). Then here is the formal setup:

Definition 4.1 (Ex interim learning equilibrium). Let M be a function from his-tories to dynamics models (for instance, models of the transition dynamics of astochastic game). Let Mt = M(H t) be an estimator of the dynamics at time t.Define the cooperative policy profile and corresponding action profile at time t as

θC,ti = arg maxθ

w(H t, θ;Mt)

AC,ti = πθC,ti(H t).5

Define the punishment policy at history H t as

θp,ti = minθ1

maxθ2

V2(H t, θ1, θ2;Mt) (7)

Player 1’s punishment policy is activated at time t if At2 6= AC,t2 . Call (σ1, σ2)an ex interim learning equilibrium if, for each history H t and i = 1, 2

σi(Ht) =

AC,ti , if punishment policy not active;

πθp,ti(H t), otherwise.

5For simplicity I assume deterministic policies (or stochastic policies in cases where the ran-dom seed is shared between agents).

27

What justifies calling (σ1, σ2) as defined above an equilibrium? At each timestep, player i does not expect to benefit by choosing an action other than that rec-ommended by σi, according to the common model estimate Mt — thus the “exinterim”. So, if the principals were willing to agree at the outset that M is a rea-sonable way for their agent to form beliefs about the world, then it seems naturalto make playing such an equilibrium a requirement of individual rationality. Still,there are obvious complications: for instance, what happens when the agents alsohave private signals which they can use to update their beliefs?

Lastly, as I’ve said before, I do not actually want to advocate for the use of aminimax punishment strategy (display 7). Another important direction is workingout more forgiving strategies which still adequately discourage exploitation.

Appendix C: Test for cooperativeness under imper-fect policy monitoringConditional on parameters θi and state Sv, the random variable logPπθi(Sv) =1 has variance

λv(θi) = 2[Pπθi(Sv) = 1 − Pπθi(Sv) = 12

]3×log[Pπθi(Sv) = 1 − Pπθi(Sv) = 12

]2.

Let θD,t2 = arg maxθ2∏t

v=1 Pπθ2(Sv) = Av2 be the maximum likelihoodparameter under the hypothesis that player 2 is playing a stationary policy. We willuse the limiting distribution of suitably normalized log likelihoods to construct ourtest. To derive a limiting distribution, define the sums of variances and average loglikelihoods (conditional on the sequence of states in H t):

ΛC,t2 =

t∑v=1

λv(θC,v2 );

ΛD,t2 =

t∑v=1

λv(θD,t2 );

`C,t2 =(

ΛC,t2

)−1[

t∑v=1

logPπθC,v2(Sv) = Av2

];

`D,t2 =(

ΛD,t2

)−1[

t∑v=1

logPπθD,t2(Sv) = Av2

].

28

Write ¯C,t2 = plimt→∞ `

C,t2 and ¯D,t

2 = plimt→∞ `D,t2 , when these limits exist. As

sums of independent random variables divided by the sums of their variances,`C,t2 and `D,t2 will (under some additional conditions (Athreya and Lahiri, 2006,

Ch. 11)) be asymptotically distributed asN

¯C,t2 ,(

ΛC,t2

)−1

andN

¯D,t2 ,(

ΛD,t2

)−1

,

respectively. Thus, supposing that(

ΛD,t2

) 12`D,t2 −

(ΛC,t

2

) 12`C,t2 ∼ N(¯D,t

2 − ¯C,t2 , 2),

we can test the hypothesis ¯D,t2 ≤ ¯C,t

2 . In the experiments here, player 1 rejects this

hypothesis (and punishes accordingly) when 1−Φ

[2−

12

(ΛD,t

2

) 12`D,t2 −

(ΛC,t

2

) 12`C,t2

]<

10−3, where Φ is the cumulative distribution function of a standard normal randomvariable.

29

Algorithm 1: Cooperative learning algorithm for player 1, σC1Input: Welfare function estimator H t 7→ w, cooperative policy learner OC ,

Markov transition function P , history H t, current block index b,sub-block sizes M b′ , N b′b′≤b

χb ← 1;/* Estimate cooperative policy */

for t =∑

b′<b(Mb′ +N b′) to

∑b′<b(M

b′ +N b′) +M b doθt1 ← OC(H t−1);θC,t2 ← OC(H t−1);At1 ← πθt1(S

t);Observe At2;if At2 = πθC,t2

(St) and χb = 1 thenχb ← 1;

elseχb ← 0;

St+1 ∼ P (· | St, At1, At2);H t ← H t−1 ∪ At1, At2, Rt

1, Rt2, S

t+1;θb,C ← OC(H t);/* Deploy cooperative policy */

for t =∑

b′<b(Mb′ +N b′) +M b to

∑b′<b(M

b′ +N b′) +M b +N b doAt1 ← πθb,C1

(St);Observe At2;if At2 = πθb,C2

(St) and χb = 1 thenχb ← 1;

elseχb ← 0;

St+1 ∼ P (· | St, At1, At2);H t ← H t−1 ∪ At1, At2, Rt

1, Rt2, S

t+1;return H t, χb

30

Algorithm 2: Punishment learning algorithm for player 1, σp1Input: Welfare function estimator H t 7→ w, punishment policy learner Op,

Markov transition function P , history H t, current block index b,sub-block sizes M b′ , N b′b′≤b

/* Estimate punishment policy */

for t =∑

b′<b(Mb′ +N b′) to

∑b′<b(M

b′ +N b′) +M b doθt1 ← Op(H t−1);At1 ← πθt1(S

t);Observe At2;St+1 ∼ P (· | St, At1, At2);H t ← H t−1 ∪ At1, At2, Rt

1, Rt2, S

t+1;θb,p ← O(H t);/* Deploy punishment policy */

for t =∑

b′<b(Mb′ +N b′) +M b to

∑b′<b(M

b′ +N b′) +M b +N b doAt1 ← πθb,p1

(St);Observe At2;St+1 ∼ P (· | St, At1, At2);H t ← H t−1 ∪ At1, At2, Rt

1, Rt2, S

t+1;return H t

Algorithm 3: Learning equilibrium strategyInput: Welfare function estimator H t 7→ w, cooperative policy learner OC ,

punishment policy learner Op, Markov transition function PInitialize θ0

1, state S0. for b = 0, 1, . . . doif χb−1 = 1 then

H tb , χb ← σC1 (H tb−1);

else if χb−1 = 0 thenH tb ← σp1(H tb−1

);χb = 1;

Player 2C D

Player 1C −1,−1 −3, 0D 0,−3 −2,−2

Table 3: Prisoner’s Dilemma expected payoffs.

31

Cooperative learning equilibrium€¦ · reinforcement learning than others which are frequently...

Documents

Transcript of Cooperative learning equilibrium€¦ · reinforcement learning than others which are frequently...