Learning Mean-Field Games - GitHub Pages · Learning Mean-Field Games Xin Guo 1Anran Hu Renyuan Xu...

Learning Mean-Field Games

Xin Guo 1 Anran Hu 1 Renyuan Xu 1 Junzi Zhang 2

AbstractIn this paper, we consider the problem of simulta-neous learning and decision-making in a stochas-tic game setting with a large population. We for-mulate this problem as a generalized mean-fieldgame (GMFG). We first analyze the existence ofthe solution to this GMFG, and show that naivelycombining Q-learning with the three-step fixed-point approach in classical MFGs yields unsta-ble algorithms. We then propose an alternatingapproximating Q-learning algorithm with Boltz-mann policy (MF-AQ) and establish its conver-gence property. The numerical performance ofthis MF-AQ algorithm on repeated Ad auctionproblem shows superior computational efficiency,when compared with existing algorithms for multi-agent reinforcement learning (MARL).

1. IntroductionMotivating example. This paper is motivated by the fol-lowing Ad auction problem for an advertiser. An Ad auctionis a stochastic game on an ad exchange platform among alarge number of players, the advertisers. In between thetime a web user requests a page and the time the page isdisplayed, usually within a millisecond, a Vickrey-type ofsecond-best-price auction is run to incentivize interestedadvertisers to bid for an Ad slot to display advertisement.Each advertiser has limited information before each bid:first, her own valuation for a slot depends on an unknownconversion of clicks for the item; secondly, she, should shewin the bid, only knows the reward after the user’s activitieson the website are finished. In addition, she has a budgetconstraint in this repeated auction.

The question is, how should she bid in this online sequentialrepeated game when there is a large population of bidders

1Industrial Engineering and Operations Research Department,UC Berkeley 2Institute for Computational & Mathematical En-gineering at Stanford University. Correspondence to: Anran Hu<anran [email protected]>, Xin Guo <[email protected]>,Renyuan Xu <[email protected]>, Junzi Zhang <[email protected]>.

competing on the Ad platform, with unknown distributionsof the conversion of clicks and rewards?

Our work. Motivated by this example of Ad auction, weconsider the general problem of simultaneous learning anddecision-making in a stochastic game setting with a largepopulation. We formulate this type of games with unknownrewards and dynamics as a generalized mean-field-game(GMFG), with incorporation of action distributions. It canalso be viewed as a general version of MFGs of McKean-Vlasov type (Acciaio et al., 2018), which is a differentparadigm from the classical MFG. It is also beyond thescope of the existing Q-learning framework for Markov de-cision problem (MDP) with unknown distributions, as MDPis technically equivalent to a single player stochastic game.

On the theory front, we establish under appropriate tech-nical conditions, the existence and uniqueness of the NEsolution to this (GMFG). On the computational front, weshow that naively combining Q-learning with the three-stepfixed-point approach in classical MFGs yields unstable al-gorithms. We then propose an alternating approximatingQ-learning algorithm with Boltzmann policy (MF-AQ) andestablish its convergence property. This MF-AQ algorithm isthen applied to analyze the Ad auction problem. Comparedwith the MF-VI algorithm with known rewards and dynam-ics and existing algorithms for multi-agent reinforcementlearning (MARL), MF-AQ demonstrates superior numericalperformance in terms of computational efficiency.

Related works. On learning large population games withmean-field approximations, (Yang et al., 2017) focused oninverse reinforcement learning for MFGs without decisionmaking, (Yang et al., 2018) studied an MARL problemwith a mean-field approximation term modeling the interac-tion between one agent and all the other finite agents, and(Kizilkale & Caines, 2013) and (Yin et al., 2014) consid-ered model-based adaptive learning for MFGs. For learninglarge population games without mean-field approximation,see (Kapoor, 2018; Hernandez-Leal et al., 2018) and thereferences therein.

In the specific topic of learning auctions with a large num-ber of advertisers, (Cai et al., 2017) and (Jin et al., 2018)explored reinforcement learning techniques to search forsocial optimal solutions with real-word data, and (Iyer et al.,

Learning mean-field games

2011) used MFGs to model the auction system with un-known conversion of clicks within a Bayesian framework.

On the subject of reinforcement learning, Q-learning and itsvariants have been widely applied to various problems withempirical success. In particular, the combination of deepneural networks and Q-learning has been the fundamentalbuilding block for some of the most successful human-levelAIs (Mnih et al., 2015; Haarnoja et al., 2017). Apart from Q-learning, there are also policy gradient (Sutton et al., 2000)and actor-critic algorithms (Konda & Tsitsiklis, 2000)). Onthe theoretical side, several algorithms have also been pro-posed with near-optimal regret bounds. (See (Osband et al.,2013) and (Jaksch et al., 2010).)

However, none of these works formulated the problem ofsimultaneous learning and decision-making in the MFGframework.

2. Problem setting2.1. Background: n-player games

Let us first recall the classical n-player games in a discretetime setting. There are n agents in a game with a finitestate space S and a finite action space A, in an infinitetime horizon. At each step t, t = 0, 1, . . . ,∞, decisions aremade. At step t, each agent i has her own state sti ∈ S ⊆ Rdand needs to take an action ati ∈ A ⊆ Rp; moreover, giventhe current state profile st = (st1, . . . , s

tn) ∈ Sn and an

action profile at = (at1, . . . , atn) ∈ An, an agent i will

receive a reward rti ∼ Ri(st,at), and her state will changeto st+1

i ∼ Pi(st,at), whereRi is the reward distribution foragent i and Pi is the transition probability for agent i. Theadmissible policy/control for agent i is of a Markovian formπi : Sn → ∆|A|, which maps each state profile s ∈ Snto a randomized action. ∆k := (x1, · · · , xk) : xj ≥0,∑kj=1 x

j = 1, which is the probability simplex.

In this infinite time setting, a discount factor γ ∈ (0, 1)is used to define the accumulated reward (a.k.a. the valuefunction) for agent i, with the initial state profile s and thepolicy profile π = (π1, . . . , πn):

Vi(s,π) := E

[ ∞∑t=0

γtri(st,at)

∣∣∣s0 = s

], (1)

where i = 1, . . . , n, ri(st,at) = E[ri ∼ Ri(st,at)|st,at],

ati ∼ πi(st), and st+1i ∼ Pi(st,at). The goal of each agent

is to maximize her reward.

There are different performance criteria for solving an n-player game. The most well-known one is the Nash equi-librium (NE), a policy profile under which no agent canimprove her value if other agents fix their policies.Definition 2.1 (NE for n-player games). π? is a Nash equi-librium policy profile for the n-player game (1) if for all

i = 1, . . . , n and the state profile s,

Vi(s,π?) ≥ Vi(s, (π?1 , . . . , πi, . . . , π?n)) (2)

holds for any πi : Sn → ∆|A|.

2.2. Generalized MFGs.

A stochastic nonzero-sum n-player game is notoriously hardto analyze. When n is large, the complexity of the problemgrows exponentially with respect to n. Mean field game(MFG), pioneered by (Huang et al., 2006) and (Lasry &Lions, 2007), considers the case when n → ∞ and pro-vides an ingenious and tractable aggregation approach to theotherwise challenging n-player stochastic games. By thefunctional strong law of large numbers, it was shown thatthe NE of an MFG is an ε -NE to the n-player game. (See(Cardaliaguet et al., 2015) for regular controls and (Guo &Lee, 2017) for singular controls.)

Now we formulate the simultaneous learning and decision-making in an generalized MFG framework. As in the clas-sical MFG framework where there are infinite number ofagents, due to the homogeneity of the agents, one can focuson a single (representative) agent. However, in contrastto the classical MFG framework where the dynamics ofeach agent are coupled through the population states withcomplete information, the dynamics of each agent in thisgeneralized MFG framework are coupled through both thepopulation states and actions with possibly incomplete in-formation.

More specifically, consider an arbitrary agent in the pop-ulation, whose state st ∈ S evolves under the followingcontrolled stochastic dynamics of mean-field type. At eachstep t,

st+1 ∼ P (·|st, at, Lt), s0 ∼ µ0, (3)

where at ∈ A is the action of the agent, µ0 is the initialdistribution, Lt = (L1

t , . . . , L|S||A|t ) = Pst,at ∈ ∆|S||A|

is the joint distribution of the state and the action (i.e., thepopulation state-action pair), and P (·|s, a, L) is the transi-tion probability (possibly unknown) over S given the actiona, the state s, and population state-action pair L. Finally, rt,the reward received at step t, follows a (possibly unknown)reward distribution R(st, at, Lt).

For notational simplicity, denote the marginal distribu-tions of s and a with joint distribution L as µ ∈∆|S| and α ∈ ∆|A|. µt and αt are the counter-parts of s−it = (s1t , . . . , s

i−1t , si+1

t , . . . , snt ) and a−it =(a1t , . . . , a

i−1t , ai+1

t , . . . , ant ) in an n-player game. Corre-sponding to the n-player game, the admissible (mean-fieldtype) policy is of the form π : S × ∆|S| → ∆|A|, whichmaps each state s ∈ S and distribution µ ∈ ∆|S| to arandomized action a ∈ A distributed according to π(s, µ).


Accordingly, given a discount factor γ ∈ (0, 1), the valuefunction for the MFG with an initial state s, an initial distri-bution µ0, an agent policy π, and a population state-actionpair sequence L := Lt∞t=0 is defined as

V (s, π,L)

:= E

[ ∞∑t=0

γtr(st, at, Lt)∣∣∣s0 = s, s0 ∼ µ0

],

(GMFG)

where r(s, a, L) = E[r|r ∼ R(s, a, L)|s, a, L], at ∼π(st, µt), and st+1 ∼ P (·|st, at, Lt).

Definition 2.2 (NE for GMFGs). In (GMFG), an agent-population profile (π?, L?) is called a stationary NE if

1. (Single agent side) For any policy π and any initialstate s ∈ S, we have

V (s, π?, L?∞t=0) ≥ V (s, π, L?∞t=0) . (4)

2. (Population side) Pst,at = L? for all t ≥ 0, wherest, at∞t=0 is the dynamics under control π? start-ing from s0 ∼ µ?, with at ∼ π?(st, µ

?), st+1 ∼P (·|st, at, L?), and µ? being the population statemarginal of L?.

These two conditions correspond to the standard three-stepfixed-point solution approach for classical MFGs. The firststep is the time-backward step for solving a Bellman equa-tion, the second step is the time-forward step for the underly-ing dynamics to ensure the consistency of the solution fromthe first step, and the last step is to find the fixed-point ofthese back-and forward iterations. Indeed, the first conditionon the single agent side is exactly the counterpart of Eqn.(2). It is worth noting that the second condition is sometimesmissing in the existing learning literature for MFGs. This isa crucial miss as this step ensures the consistency and theeventual correctness of the solution.

It is worth commenting that classical MFGs only involvethe state-distribution µ without considering the action-distribution α, thus players taking pure strategies. In con-trast, in the GMFG framework, players could take mixedstrategies according to π.

Example. Take the example of the repeated auction witha budget constraint in Section 1. Denote st ∈ [Cmax] as thebudget of a representative advertiser at time t, where [n] =0, 1, · · · , n and Cmax ∈ N+ is the maximum budgetallowed on the Ad exchange with a unit bidding price, andat the bid price submitted by this advertiser.

At each round of the auction, assume a random number, sayK bidders, will be randomly selected from the population tocompete for the auction. Here K reflects the intensity of the

bidding game. (See (Iyer et al., 2011) for the discussion onK and Figure 6 in this paper for related numerical analysis.)Denote αt as the bidding distribution of the population.The maximum bid Dt from the K − 1 opponents has a

cumulative mass function FDt(d) :=(∑d

a=0 αt(a))K−1

.Finally, the reward for the representative advertiser, sayplayer 1, with bid a1t and budget st at round t in which sheis selected can be written as

rt = Iwt=1

[(vt − asecond

t )− (1 + δ)Ist<asecondt

(asecondt − st)

]. (5)

Here vt the conversion of clicks at time t follows an un-known distribution, ajt

i.i.d∼ αt for 2 ≤ j ≤ K, asecondt =

maxa1t , · · · , ant \amaxt andwt is the winner. wt = 1,

meaning the first advertiser wins the bid, is with probability1

# arg max[K]a1t , a2t , · · · , aKt .

The dynamics of the budget is

st+1 =

st, wt 6= 1,st − asecond

t , wt = 1 and asecondt ≤ st,

0, wt = 1 and asecondt > st.

The updated action distribution is αt+1 = 〈µt, πt+1(·, µt)〉,where 〈·, ·〉 stands for inner product. In this example, boththe reward distribution rt and the dynamics st are unknown.

2.3. (Unique) solution for GMFGs

Parallel to the classical MFG framework, we now establishthe existence and uniqueness of the NE solution to (GMFG).This result is a generalization of Theorem 6 in (Huang et al.,2006).

To start, take any fixed L ∈ ∆|S||A|, define a mapping

Γ1 : ∆|S||A| → Π := π | π : S ×∆|S| → ∆|A|,

such that π = Γ1(L) with π(s) = argmax-e(Q?L(s, ·)).Here the argmax-e operator is a generalized argmax op-erator, defined to map the set of argmax components tothe randomized actions, such that actions with equal max-imum Q-values would have equal probabilities to be se-lected. Note that, this is different from the usual set-valuedargmax operator. Mathematically, argmax-e : Rn → Rn isa mapping with argmax-e(x)i = 1/Nx for i ∈ argmaxi xiand argmax-e(x)i = 0 for i /∈ argmaxi xi, where Nx =#argmaxi xi. By definition, Γ1(L) is the optimal policyfor an MDP with dynamics PL and RL (to be introducedin Section 3), and therefore satisfies the single agent sidecondition in Definition 2.2.

Assumption 1. There exists a constant d1 ≥ 0, such thatfor any L1, L2 ∈ ∆|S||A|,

‖Γ1(L1)− Γ1(L2)‖2 ≤ d1‖L1 − L2‖2. (6)


This Assumption is closely related to the feedback regularity(FR) condition in the classical MFG literature (Huang et al.,2006).

Next, for any admissible policy π ∈ Π and a joint populationstate-action pair L ∈ ∆|S||A|, define a mapping Γ2 : Π ×∆|S||A| → ∆|S||A| as follows:

Γ2(π, L) := L = Ps1,a1 , (7)

where a1 ∼ π(s1, µ1), s1 ∼ µP (·|·, a0, L), a0 ∼ π(s0, µ),s0 ∼ µ, and µ is the population state marginal of L.

Assumption 2. There exist constants d2, d3 ≥ 0, suchthat for any admissible π, π1, π2 ∈ S × ∆|S| and jointdistributions L,L1, L2,

‖Γ2(π1, L)− Γ2(π2, L)‖2 ≤ d2‖π1 − π2‖2, (8)

‖Γ2(π, L1)− Γ2(π, L2)‖2 ≤ d3‖L1 − L2‖2. (9)

Theorem 1 (Existence and Uniqueness of Stationary MFGsolution). Given Assumptions 1 and 2, and assume d1d2 +d3 < 1. Then there exists a unique stationary MFG solutionto (GMFG) under dynamics (3).

Proof. First by Definition 2.2 and the definitions of Γi(i = 1, 2), (π, L) is a stationary Nash equilibrium so-lution iff L = Γ(L) = Γ2(Γ1(L), L) and π = Γ1(L),where Γ(L) := Γ2(Γ1(L), L). This indicates that for anyL1, L2 ∈ ∆|S||A|,

‖Γ(L1)− Γ(L2)‖2=‖Γ2(Γ1(L1), L1)− Γ2(Γ1(L2), L2)‖2≤‖Γ2(Γ1(L1), L1)− Γ2(Γ1(L2), L1)‖2

+ ‖Γ2(Γ1(L2), L1)− Γ2(Γ1(L2), L2)‖2≤(d1d2 + d3)‖L1 − L2‖2.

(10)

And since d1d2 + d3 ∈ [0, 1), by the Banach fixed-pointtheorem, we conclude that there exists a unique fixed-pointof Γ, or equivalently, a unique stationary MFG solution to(GMFG).

3. RL Algorithms for GMFGsGiven the unknown reward and transition distributions inthe generalized MFG setting, one needs to simultaneouslylearn P and R while solving the MFG in an online manner,where the data is collected on the run for both the learningand controlling purposes. This is related to the classical rein-forcement learning problem, with the key differences beingMDPs replaced with MFGs and global optimum replacedby NEs.

The learning aspect is best reflected in the single agentside in Definition 2.2. Indeed, given L?, one can re-trieve π? by solving an MDP with transition dynamics

PL?(s′|s, a) := P (s′|s, a, L?) and reward distributionsRL?(s, a) := R(s, a, L?). Extending this to a general pop-ulation state-action pair L, the Q-function of an MDPML

with dynamics PL and RL is written as

Q?L(s, a) := rL(s, a) +∑s′∈S

PL(s′|s, a)V ?L (s′), (11)

where rL(s, a) = E[r|r ∼ RL(s, a)], V ?L is the optimalvalue function of the MDP, and V ∗L (s) = maxaQ

?L(s, a).

Note that when reward RL and transition dynamicsPL(s′|s, a) are known, a value iteration method can beapplied until convergence to estimate the Q-function in(11). However, when the reward RL and transition dy-namics PL(s′|s, a) are unknown, one can only update theQ-function based on historical observations. For example,take r(s, a, L) as an approximation of E[r|r ∼ RL(s, a)]and maxa′ QL(s′, a′) as an approximation of V ∗L (s′), thenone can update QL as

QL(s, a)← QL(s, a) + β [r(s, a, L)

+ γmaxa′ Q(s′, a′)−QL(s, a)] ,(12)

where s′ is the next state after taking action a at state s,r(s, a, L) is the observed reward by taking action a at states, β is the learning rate, and γ is the discounted rate. Thisis referred to as Q-learning in the literature.

3.1. Q-learning for GMFGs

In this section, we propose a Q-learning algorithm for the(GMFG). We begin by noticing the following lower boundsof action gaps.

For any ε > 0, there exist a positive function φ(ε) and an ε-net Sε := L(1), . . . , L(Nε) of ∆|S||A|, with the propertiesthat mini=1,...,Nε ‖L−L(i)‖2 ≤ ε for anyL ∈ ∆|S||A|, andthat maxa′∈AQ

?L(i)(s, a

′)−Q?L(i)(s, a) ≥ φ(ε) for any i =

1, . . . , Nε, s ∈ S, and any a /∈ argmaxa∈AQ?L(i)(s, a).

Here the existence of ε-nets is trivial due to the compactnessof the probability simplex ∆|S||A|, and the existence of φ(ε)comes from the finiteness of the action set A.

These lower bounds characterize the extent to which ac-tions are distinguishable in terms of the corresponding Q-values. They are crucial for approximation algorithms(Bellemare et al., 2016), and are closely related to theproblem-dependent bounds for regret analysis in reinforce-ment learning, multi-armed bandits, and advantage learningalgorithms including A2C (Minh et al., 2016).

In the following, we assume that for any ε > 0, Sε and φ(ε)are known to the algorithm. In practice, one can choosea special form Dεα with D > 0 for φ(ε). The exponentα > 0 characterizes the decay rate of the action gaps on thefineness of the ε-nets.


In addition, to enable Q-learning, we assume that one hasaccess to a population simulator (See (Perolat et al., 2016;Wai et al.)). That is, for any policy π ∈ Π, given thecurrent state s ∈ S, for any joint law L (with popula-tion state marginal µ), we can obtain the next agent states′ ∼ P (·|s, π(s, µ), L), a reward r ∼ R(s, π(s, µ), L), andthe next population state-action pair L′ = Ps′,π(s′,µ). Forbrevity, we denote the simulator as (s′, r, L′) = G(s, π, L).

3.2. Alternating Q-learning

Observing the similarity between the fixed-point equationL = Γ2(Γ1(L), L) and the Bellman equation in classicalreinforcement learning, and noticing that the evaluationof Γ1(L) is equivalent to solving an MDP with transitiondynamics PL(s′|s, a) := P (s′|s, a, L) and reward distribu-tions RL(s, a) := R(s, a, L), it is natural to consider thefollowing (naive) iterative algorithm (Algorithm 1) follow-ing the three-step fixed-point solution approach of MFGs.That is, Lk → Q?k → πk → Lk+1 → Q?k+1 → ....

Algorithm 1 Alternating Q-learning for GMFGs (Naive)1: Input: Initial population state-action pair L0

2: for k = 0, 1, · · · do3: Perform Q-learning to find the Q-function

Q?k(s, a) = Q?Lk(s, a) of an MDP with dynamicsPLk(s′|s, a) and reward distributions RLk(s, a).

4: Solve πk ∈ Π with πk(s) = argmax-e (Q?k(s, ·)).5: Sample s ∼ µk, where µk is the population state

marginal of Lk, and obtain Lk+1 from G(s, πk, Lk).6: end for

However, the update of Q?k requires fully solving an MDPwith infinite steps. This issue can be resolved by using anapproximate Q-learning updates. The second modificationis to use Boltzmann policies to replace the argmax operator,whose discontinuity and sensitivity could be the culprit forthe error and fluctuation shown in Figure 1. Finally, in eachstep Lk is projected to the ε-net to utilize the action gaps,which will be shown essential to control the sub-optimalityof Boltzmann policies. The resulting algorithm is summa-rized in Algorithm 2.

Here softmaxc : Rn → Rn is defined as

softmaxc(x)i =exp(cxi)∑nj=1 exp(cxj)

, (13)

and ProjSε(L) = argminL(1),...,L(Nε )‖L(i) − L‖2. The de-tails about the choices of hyper-parameters c and Tk will bediscussed in details in Lemma 5 and Theorem 2.

In the special case when the reward RL and transition dy-namics P (·|s, a, L) are known, Algorithm 2 can be reducedto a value-iteration based algorithm, as follows.

Algorithm 2 Alternating Approximate Q-learning forGMFGs (MF-AQ)

1: Input: Initial L0, tolerance ε > 0.2: for k = 0, 1, · · · do3: Perform Q-learning for Tk iterations to find the ap-

proximate Q-function Q?k(s, a) = Q?Lk(s, a) of anMDP with dynamics PLk(s′|s, a) and reward distri-butions RLk(s, a).

4: Compute πk ∈ Π with πk(s) = softmaxc(Q?k(s, ·)).5: Sample s ∼ µk, where µk is the population state

marginal of Lk, and obtain Lk+1 from G(s, πk, Lk).6: Find Lk+1 = ProjSε(Lk+1)7: end for

Algorithm 3 Alternating Approximate Value Iterationfor GMFGs (MF-VI)

1: Input: Initial L0, tolerance ε > 0.2: for k = 0, 1, · · · do3: Perform value iteration for Tk iterations to find the

approximate Q-function and value function4: for t=1, 2, · · · , Tk do5: for all s ∈ S and s ∈ A do6: QLk(s, a)← E[r|s, a, Lk]

+γ∑s′ P (s′|s, a, Lk)VLk(s′)

7: VLk(s)←a QLk(s, a)8: end for9: end for

10: Compute a policy πk ∈ Π:πk(s) = softmaxc(QLk(s, ·)).

11: Sample s ∼ µk, where µk is the population statemarginal of Lk, and obtain Lk+1 from G(s, πk, Lk).

12: Find Lk+1 = ProjSε(Lk+1)13: end for

3.3. Convergence

We now show that for any given tolerance ε > 0, the MF-AQalgorithm (Algorithm 2) converges to an ε-Nash solution of(GMFG).

Theorem 2 (Convergence of MF-AQ). Given the same as-sumptions in Theorem 1 and the same conditions specifiedbelow in Lemma 5, with Tk = TML(δk, εk),

∑∞k=0 δk =

δ <∞, and∑∞k=0 εk <∞, and c = log(1/ε)

φ(ε) . Then for anyspecified tolerance ε > 0, lim supk→∞ ‖Lk−L?‖2 = O(ε)with probability at least 1− 2δ.

The proof of Theorem 2 depends on the following Lem-mas 3, 4, and 5 regarding the softmax mapping and theQ-learning convergence of the MDP subproblem with afixed Lk.

Lemma 3 ((Gao & Pavel, 2017)). The softmax function isc-Lipschitz, i.e., ‖softmax(x)− softmax(y)‖2 ≤ c‖x−y‖2


for any x, y ∈ Rn.Lemma 4. The distance between the softmax and theargmax mapping is bounded by

‖softmaxc(x)− argmax-e(x)‖2 ≤ 2n exp(−cδ),

where δ = maxi=1,...,n xi −maxxj<maxi=1,...,n xi xj , withδ :=∞ when all xj are equal.

Proof. Without loss of generality, let us assume that x1 =x2 = · · · = xm = maxi=1,...,n xi = x? > xj for allm < j ≤ n. Then

argmax-e(x)i =

1m , i ≤ m,0, otherwise.

softmaxc(x)i =

ecx

?

mecx?+∑nj=m+1 e

cxj , i ≤ m,ecxi

mecx?+∑nj=m+1 e

cxj , otherwise.

Therefore

‖softmaxc(x)− argmax-e(x)‖2≤‖softmaxc(x)− argmax-e(x)‖1

=m

(1

m− ecx

?

mecx? +∑nj=m+1 e

cxj

)

+

∑ni=m+1 e

cxi

mecx? +∑nj=m+1 e

cxj

=2∑ni=m+1 e

cxi

mecx? +∑ni=m+1 e

cxi=

2∑ni=m+1 e

−cδi

m+∑ni=m+1 e

−cδi

≤ 2

m

n∑i=m+1

e−cδi ≤ 2(n−m)

me−cδ ≤ 2ne−cδ,

with δi = xi − x?.

Lemma 5 ((Even-Dar & Mansour, 2003)). For an MDP, sayM, suppose that the Q-learning algorithm takes step-sizes

αt(s, a) =

|#(s, a, t) + 1|−ω, (s, a) = (st, at),

0, otherwise.

with ω ∈ (1/2, 1). Here #(s, a, t) is the number of timesup to time t that one visits the state-action pair (s, a). Alsosuppose that the covering time of the state-action pairs isbounded by L with probability at least p ∈ (0, 1). Then‖QTM(δ,ε) − Q?‖2 ≤ ε with probability at least 1 − 2δ.Here QT is the T -th update in Q-learning, and Q? is the(optimal) Q-function, given that

TM(δ,ε) = Ω

((L logp(δ)

βlog

Vmax

ε

) 11−ω

+

(L logp(δ))1+3ω

V 2max log

(|S||A|Vmax

δβε

)β2ε2

1ω ,

where β = (1 − γ)/2, Vmax = Rmax/(1 − γ), andRmax is an upper bound on the extreme difference be-tween the expected rewards, i.e., maxs,a,L r(s, a, L) −mins,a,L rs,a,L ≤ Rmax.

Here the covering time of a state-action pair sequence isdefined to be the number of steps needed to visit all state-action pairs starting from any arbitrary state-action pair.

Proof of Theorem 2. Define Γk1(L) := softmaxc(Q?Lk

).

Let L? be the population state-action pair in a stationary NEsolution of (GMFG). Then we have πk = Γk1(Lk), and bydenoting d := d1d2 + d3,

‖Lk+1 − L?‖2 = ‖Γ2(πk, Lk)− Γ2(Γ1(L?), L?)‖2≤‖Γ2(Γ1(Lk), Lk)− Γ2(Γ1(L?), L?)‖2

+ ‖Γ2(Γ1(Lk), Lk)− Γ2(Γk1(Lk), Lk)‖2≤‖Γ(Lk)− Γ(L?)‖2 + d2‖Γ1(Lk)− Γk1(Lk)‖2≤(d1d2 + d3)‖Lk − L?‖2

+ d2‖argmax-e(Q?Lk)− softmaxc(Q?Lk)‖2≤d‖Lk − L?‖2

+ d2‖softmaxc(Q?Lk)− softmaxc(Q?Lk)‖2+ d2‖argmax-e(Q?Lk)− softmaxc(Q?Lk)‖2

≤d‖Lk − L?‖2 + cd2‖Q?Lk −Q?Lk‖2

+ d2‖argmax-e(Q?Lk)− softmaxc(Q?Lk)‖2.

Then since Lk ∈ Sε by the projection step, by Lemma4, Lemma 5 (and the choice of Tk = TML(δk, εk)), wehave that with probability at least 1− δk, ‖Lk+1 − L?‖2 ≤d‖Lk − L?‖2 + cd2εk + 2d2|A|e−cφ(ε).

Finally, it is clear that with probability at least 1− δk,

‖Lk+1 − L?‖2 ≤ ‖Lk+1 − L?‖2 + ‖Lk+1 − Proj(Lk+1)‖2≤ d‖Lk − L?‖2 + cd2εk + 2d2|A|e−cDε

α

+ ε.

By telescoping, this implies that with probability atleast 1 −

∑Kk=0 δk, ‖LK − L?‖2 ≤ dK‖L0 − L?‖2 +

cd2∑K−1k=0 dK−kεk +

(2d2|A|e−cφ(ε) + ε)(1− dK+1)

1− d.

Since εk is summable and supk≥0 εk < ∞, we

have∑K−1k=0 dK−kεk ≤

supk≥0 εk

1− ddb(K−1)/2c +∑∞

k=d(K−1)/2e εk → 0 as K →∞.

Taking the limit K → ∞, and by d ∈ [0, 1) and c =log(1/ε)φ(ε) , we have that with probability at least 1 − δ,

lim supk→∞ ‖Lk − L?‖2 ≤2d2|A|+ 1

1− dε = O(ε).


4. Experiment: repeated auction game withbudget constraint

In this section, we test the proposed MF-AQ Algorithmagainst Naive, MF-VI, and a couple of popular RL algo-rithms.

Recall that in the repeated auction game, the goal is for eachadvertiser to learn to bid in an online sequential repeatedauction, when there are budget constraints and a large pop-ulation of bidders. Here we assume that the distributionsof the conversion of clicks v and the reward function R areunknown as in Section 2.2.

To emphasize the impact of αt, here we fix the budgetdistribution as independent of time and known a prior. Thatis, µt = µC and given. This can be realized by consideringnew players joining the competition according to a Poissonprocess.

Here are the parameters used in the experiments: the tem-perature parameter c = 4.0, the discount factor γ = 0.2,ω = 0.87 in Lemma 5, and the overbidding penalty δ = 0.2.Assume |S| = 10, |A| = 10, and the budget distributionµC = uniform[9].

4.1. Comparing MF-AQ with Naive algorithm

Naive algorithm Assume K = 5, v is uniform[4], andα0 is uniform[9]. Moreover, set 10000 inner iterations stepbetween two consecutive updates of action distributions.

Figure 1 shows that the Naive algorithm does not convergein 1000 outer iterations. In particular, αt keeps fluctuating.

(a) ‖αt+1 − αt‖∞ (b) µDt and αt.

Figure 1. Performance of Naive Algorithm.

MF-AQ Assume the same conversion of clicks v asuniform[4].

In comparison to the Naive algorithm, MF-AQ Algorithm iscomputationally more efficient: it converges after about 10outer iterations; as the number of inner iterations increases,the error decreases; and finally, MF-AQ is responsive to theconversion of the clicks even though it is unknown to theadvertisers, as shown in Figure 3.

Figure 2. Different number of inner iterations within each outeriteration.

(a) Action distribution (b) Max-bid distribution

Figure 3. Equilibrium under different distributions of v.

4.2. Comparing MF-AQ with MF-VI

In addition to Theorem 2 that provides the convergence guar-antee of the MF-AQ algorithm when reward and transitiondynamics are unknown, the experiment shows that MF-AQperforms well against MF-VI, which assumes known rewardand transition dynamics.

Set-up. K = 5, α0 = uniform[9], and v is uniform[4].

Result. The heatmap in Figure 4(a) is the Q-table for MF-AQ Algorithm with 20 outer iterations. Within each outeriteration, there are TMF-AQ

k = 10000 inner iterations. Theheatmap in Figure 4(b) is the Q-table for MF-AQ Algorithmafter 20 outer iterations. Within each outer iteration, thereare TMF-VI

k = 5000 inner iterations. MF-VI Algorithm con-verges faster than MF-AQ Algorithm, as the latter needsto make decisions with simultaneous learning whereas theformer is performed with given dynamics and rewards. Therelative L2 distance between the Q-tables of these two Al-

(a) QMF-AQ (b) QMF-VI.

Figure 4. Q-tables: MF-AQ vs. MF-VI.


Table 1. Q-table comparison given TMF-VIk = 5000.

TMF-AQk 1000 3000 5000 10000∆Q 0.21263 0.1294 0.10258 0.0989

gorithms is ∆Q := ‖QMF-VI−QMF-AQ‖2‖QMF-VI‖2 = 0.098879. This

implies that MF-AQ Algorithm learns well: it learns thetrue MFG solution with 90-percent accuracy with 10000inner iterations.

4.3. Comparing MF-AQ with existing learningalgorithms.

We next compare the MF-AQ algorithm with some popularmulti-agent reinforcement learning algorithms: (1) IL algo-rithm and (2) MF-Q algorithm. IL algorithm (Tan, 1993)considers n independent learners and each player solves adecentralized reinforcement learning problem without con-sidering other players in the system. The MF-Q algorithm(Yang et al., 2018) is an extension of the NASH-Q Learn-ing algorithm for general n-player game introduced in (Hu& Wellman, 2003). It considers the the aggregate actions(aaa−i =

∑j 6=i aj

n−1 ) from the opponent of each player. Thismodel reduces the computational complexity of the gamebut works for a restricted class of models where the interac-tions are only through the mean of the actions.

Performance Criterion: For a given policy π, the follow-ing metric (Perolat et al., 2018) is adopted to measure howclose the policy is to an NE policy:

C(πππ) =1

|S|

n∑i=1

∑s∈S

(maxπi

Vi(sss, (πππ−i, πi))− Vi(sss,πππ)

).

If πππ∗ is an NE, by definition C(πππ∗) = 0 and it is easy tocheck that C(πππ) ≥ 0. Policy arg maxπi Vi(sss, (πππ

−i, πi))) iscalled the best response to πππ−i.

Note that there are other performance criteria in the literaturesuch as the total discounted reward. However, in a gamesetting, these may not be appropriate as high rewards forone player may not be beneficial for the others.

Here we assume v is uniform[4]. IL algorithm converges thefastest around 10000 iterations with the largest error 0.1913.This is because IL Algorithm does not incorporate informa-tion from other players. MF-Q algorithm converges around250000 iterations with smaller error 0.1131. This error isstill bigger than the error from MF-AQ. This is becauseMF-Q can only incorporate the first order information fromthe population instead of the empirical distribution from thewhole population, which is not enough for the auction game.The performance of MF-AQ improves as the number oftotal iterations increases, and converges to the lowest error0.0691 around 20000 total iterations.

Figure 5. Performance comparison

(a) Action distribution with v ∼uniform[3]

(b) Max-bid distribution withv ∼ uniform[3]

(c) Action distribution with v ∼exp(4)

(d) Max-bid distribution withv ∼ exp(4)

Figure 6. Equilibrium under different number of competitors.

4.4. Additional insight of MF-AQ.

Figure 6 compares the action and max-bid distributionsunder different values of K with two distributions for v:uniform[3] and exp(1). When v follows uniform[3], it isnot beneficial to bid over price 3 by the definition of rewardfunction (5). In Figure 6(a), the center of the distributionis no bigger than 3. When K increases, players start to actmore aggressively and bid over 3 more frequently. Similarobservations are obtained when v ∼ exp(4), as shown inFigure 6(c).

5. ConclusionThis paper builds a generalized mean-field games frameworkfor simultaneous learning and decision-making, establishesthe existence and uniqueness of appropriate Nash equilib-rium solutions, and proposes a Q-learning algorithm withproven convergence. Numerical experiments demonstratesuperior performance compared to existing RL algorithms.


ReferencesAcciaio, B., Backhoff, J., and Carmona, R. Extended mean

field control problems: stochastic maximum principleand transport perspective. Arxiv Preprint:1802.05754,2018.

Bellemare, M. G., Ostrovski, G., Guez, A., Thomas, P. S.,and Munos, R. Increasing the action gap: new opera-tors for reinforcement learning. In AAAI Conference onArtificial Intelligence, pp. 1476–1483, 2016.

Cai, H., Ren, K., Zhang, W., Malialis, K., Wang, J., Yu,Y., and Guo, D. Real-time bidding by reinforcementlearning in display advertising. In Proceedings of theTenth ACM International Conference on Web Search andData Mining, pp. 661–670. ACM, 2017.

Cardaliaguet, P., Delarue, F., Lasry, J.-M., and Lions, P.-L. The master equation and the convergence problem inmean field games. Arxiv Preprint:1509.02505, 2015.

Even-Dar, E. and Mansour, Y. Learning rates for Q-learning.Journal of Machine Learning Research, 5(Dec):1–25,2003.

Gao, B. and Pavel, L. On the properties of the softmax func-tion with application in game theory and reinforcementlearning. Arxiv Preprint:1704.00805, 2017.

Guo, X. and Lee, J. S. Mean field games with singularcontrols of bounded velocity. Arxiv Preprint:1703.04437,2017.

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Re-inforcement learning with deep energy-based policies.Arxiv Preprint:1702.08165, 2017.

Hernandez-Leal, P., Kartal, B., and Taylor, M. E. Is mul-tiagent deep reinforcement learning the answer or thequestion? A brief survey. Arxiv Preprint:1810.05587,2018.

Hu, J. and Wellman, M. P. Nash Q-learning for general-sumstochastic games. Journal of Machine Learning Research,4(Nov):1039–1069, 2003.

Huang, M., Malhame, R. P., Caines, P. E., et al. Large popu-lation stochastic dynamic games: closed-loop McKean-Vlasov systems and the Nash certainty equivalence prin-ciple. Communications in Information & Systems, 6(3):221–252, 2006.

Iyer, K., Johari, R., and Sundararajan, M. Mean field equi-libria of dynamic auctions with learning. ACM SIGecomExchanges, 10(3):10–14, 2011.

Jaksch, T., Ortner, R., and Auer, P. Near-optimal regretbounds for reinforcement learning. In Journal of MachineLearning Research, 2010.

Jin, J., Song, C., Li, H., Gai, K., Wang, J., and Zhang, W.Real-time bidding with multi-agent reinforcement learn-ing in display advertising. Arxiv Preprint:1802.09756,2018.

Kapoor, S. Multi-agent reinforcement learning: A report onchallenges and approaches. Arxiv Preprint:1807.09427,2018.

Kizilkale, A. C. and Caines, P. E. Mean field stochastic adap-tive control. IEEE Transactions on Automatic Control,58(4):905–920, 2013.

Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms.In Advances in Neural Information Processing Systems,2000.

Lasry, J.-M. and Lions, P.-L. Mean field games. JapaneseJournal of Mathematics, 2(1):229–260, 2007.

Minh, V. M., Badia, A. P., Mirza, M., Graves, A., Lillicrap,T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-chronous methods for deep reinforcement learning. InInternational Conference on Machine Learning, 2016.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M., Graves, A., Riedmiller, M., Fidjeland,A., Ostrovski, G., and Petersen, S. Human-level controlthrough deep reinforcement learning. Nature, 518(7540):529, 2015.

Osband, I., Russo, D., and Roy, B. V. (More) efficient rein-forcement learning via posterior sampling. In Advances inNeural Information Processing Systems, pp. 3003–3011,2013.

Perolat, J., Strub, F., Piot, B., and Pietquin, O. LearningNash equilibrium for general-sum Markov games frombatch data. Arxiv Preprint:1606.08718, 2016.

Perolat, J., Piot, B., and Pietquin, O. Actor-critic fictitiousplay in simultaneous move multistage games. In Interna-tional Conference on Artificial Intelligence and Statistics,2018.

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour,Y. Policy gradient methods for reinforcement learningwith function approximation. In Advances in NeuralInformation Processing Systems, pp. 1057–1063, 2000.

Tan, M. Multi-agent reinforcement learning: independentvs. cooperative agents. In International Conference onMachine Learning, pp. 330–337, 1993.

Wai, H. T., Yang, Z., Wang, Z., and Hong, M. Multi-agentreinforcement learning via double averaging primal-dualoptimization. Arxiv Preprint:1806.00877.


Yang, J., Ye, X., Trivedi, R., Xu, H., and Zha, H. Deepmean field games for learning optimal behavior policy oflarge populations. Arxiv Preprint:1711.03156, 2017.

Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., and Wang,J. Mean field multi-agent reinforcement learning. ArxivPreprint:1802.05438, 2018.

Yin, H., Mehta, P. G., Meyn, S. P., and Shanbhag, U. V.Learning in mean-field games. IEEE Transactions onAutomatic Control, 59(3):629–644, 2014.

Learning Mean-Field Games - GitHub Pages · Learning Mean-Field Games Xin Guo 1Anran Hu Renyuan Xu...

Documents

Transcript of Learning Mean-Field Games - GitHub Pages · Learning Mean-Field Games Xin Guo 1Anran Hu Renyuan Xu...