Learning Games Presented by: Aggelos Papazissis Alexandros Papapostolou.

Learning Games

Presented by:

Aggelos Papazissis

Alexandros Papapostolou

Subject

Correlated Equilibria Learning, Mutation & Long run equilibria Rational learning leading to NE Dynamic Fictitious play & Dynamic Gradient Play

Computing Correlated Equilibria in Multi-Player Games Nash Equilibrium (N.E.) is the standard notion of rationality in game

theory Correlated Equilibrium (C.E.) is another competing notion of rationality,

more general than N.E.

Christos PapadimitriouTim Roughgarden

Advantages C.E vs. N.E.:1.It is guaranteed to always exist2.Is arises from simple and natural dynamics in sense that the N.E. does not3.It can be found in polynomial time for any number of players and strategies by linear programming4.Ensures Pareto Optimality within the set of correlated equilibria

Representation Size(O(…)) Pure NE Mixed NE CE Optimal CE

Normal form nsn Linear PPAD-complete P P

Graphical nsd+1 NP-complete PPAD-complete P NP-hard

Symmetric NP-complete PPAD-complete P P

Anonymous NP-hard P P

Polymatrix n2 * s2 PPAD-complete P NP-hard

Circuit NP-complete

Congestion PLS-complete P NP-hard

The idea of Correlated Equilibria While the mixed N.E. is a distribution on the strategy space that

is “uncorrelated”(independent distributions), a C.E. is general distribution over strategy profiles.

Each player chooses his action according to his observation of the value of the same public signal.

A strategy assigns an action to every possible observation a player can make.

If no player would want to deviate from the recommended strategy (assuming the others don't deviate but obey),because is the best in expectation, the distribution is called a correlated equilibrium.

An example of C.E. Dare-Chicken Out

D C

D 0 , 0 7 , 2

C 2 , 7 6 , 6

5 distributions that are C.E.

0 1

0 0

0 0

1 0

0 1/2

1/2 0

0 1/3

1/3 1/3

1/4 1/4

1/4 1/4

Pure N.E. Pure N.E. Mixed N.E. Both players playing the mixed strategy {1/2,1/2} (3.75,3.75)

Traffic Light (4.5,4.5)

A third party draws 3 cards:(C,C),(D,C),(C,D)D: don’t deviateC: the other plays C with ½ and D with ½ .Daring:0(1/2)+7(1/2)=3.5Chicken:2(1/2)+6(1/2)=4Therefore, chooses C.Nobody wants to change strategy and so C.E.Expected payoff=7(1/3)+2(1/3)+6(1/3)=5Better of mixed N.E.

Author: Iskander Karibzhanov

Computing Correlated Equilibria (Pmp’s) Existence Proof: Every game has a CE. Linear Programming Duality. (P)-(D) is

infeasible Constructive Proof: Apply Ellipsoid Algorithm to Dual, which has polynomially many

variables and exponentially many constraints, so as to reduce the number of dual constraints to polynomial and that number suffices to compute CE’s and furthermore an optimal CE.

At each step of the ellipsoid algorithm we find violated convex combinations of the constraints of (D) by using Markov Chain computations.

At the conclusion of the algorithm, we have a polynomial number of such combinations (the cuts of ellipsoid algorithm) that are themselves in feasible. We call them (D’)

Solving this dual of this new linear program, (P’), gives us the required CE as a polynomial mixtures of products (pmp’s)

To optimize the CE, we reduce the dimensiomality of linear programming. For each player, the strategy profiles of all opponents are divided into equivalence classes, in which player’s utility depends only on his own choice.

Main Result : Every game has a CE that is a convex combination of product distributions, whose is polynomial in n(players) and m(strategies)

Computing Optimal Correlated Equilibria The algorithm provides a refinement of a polynomial CE scheme that is

guaranteed to sample the strategy space of the given game G according to an optimal CE.

It can be formulated as a linear program, the dual of which can be solved in polynomial time via the ellipsoid method.

The ellipsoid method will only generate a polynomial number of dual constraints during its execution and these constraints suffice to compute an optimal dual solution

The primal variables (strategy profiles) that correspond to these dual constraints then suffice to solve the primal problem optimally.

Since this “reduced primal” has a polynomial number of variables and constraints, it can be solved in polynomial time, yielding an optimal CE.

Learning, Mutation, and Long Run Equilibria in Games Appliance in 2x2 Symmetric Games Natural Selection of Long-Run Equilibria through

learning, bounded rationality and mutation

3 Hypothesis:

1. Inertia Hypothesis:Not all agents need to react instantaneously to their environment

2. Myopia Hypothesis:Players react myopically

3. Mutation Hypothesis:Some agents may change their strategies at random

A Symmetric 2x2 GameII

Is1 s2

s1 2,2 0,0

s2 0,0 1,1

Choice of computer system in a small community:•Ten students, using one of the two systems s1,s2.•They randomly meeting each other and when two are in the same computer they collaborate.•s1 superior of s2

2 N.E.: E1=(s1,s1) E2(s2,s2)

Path dependence: If at least 4 students are initially using computer s1, then all students eventually end up using s1

Mutation: student leaves, the newcomer chooses s1 according to s1-users of outside world. Thus, with mutations we change the two equilibria E1 and E2(Darwinian adjustments)

E1 E2

p

P’ (p’/(p+p’),p/(p+p’))As e->0 (1,0) E1 Pareto dominant Eq.Long-Run Eq. Takes 100000 periods to be upset while E2 takes 78Assume mutation rate is small

Coordination Games0 1 2 3 4 5 6

Critical level z*

0 1 2 3 4 5 6(0) (0)(0) (0)(0)

Stage 6:{3,4,5,6}

Stage 0:{0,1,2}

The least-cost 6-tree: 3 mutations

(0) (0)(0)

(3)

The least-cost 0-tree: 4 mutations

Conclusion: Stage 6, which has the larger basin of attraction, achieves the minimum among all states.The Eq. with the largest basin of attraction is always selected in the long run

Thus, upsetting equilibria by large jumps is a natural consequence of independent mutations.

(4)

0 1 2 3 4 5 6(0) (0)(0) (0)(0)

(2)

0 1 2 3 4 5 6(0) (0)

(2)Less likely than immediate jumps. 4 mutations

Rational Learning leads to NE

Nash equilibrium central concept Only once played games – not fully understood process Repeated interaction – enough time to observe opponents’ behavior – obtain

statistical learning theory leading to NE Repeated play among a small number of subjectively rational agents Shortcomings of ‘myopic’ behavior of player1. Ignores the fact that his opponents use also a dynamic learning process2. Would not perform a costly experiment3. Ignores strategic considerations regarding the future Theoretic approach to overcome these flaws1. standard perfect monitoring infinitely repeated game with discounting2. fixed matrix of payoff for every action combination by the group of players3. discount factor that is used to evaluate future payoffs4. maximize the present value of total expected payoff5. need not have any info about opponents’ payoff matrix nor their rationality

Rational Learning leads to NE

Main message of paper if players start with a vector of subjectively rational strategiesif their individual subjective beliefs regarding opponents’ actions are compatible with the truly chosen strategiesthen they must converge in finite time to play according to an ε-NE (for small ε)Means they will learn to predict the future play of the game and play ε-NE

Assumptions of model1. Players’ objective to maximize their long term expected discounted payoff Learning is not a

goal itself but rather a consequence of overall plans2. Learning through Bayesian updating of individual prior beliefs3. Not requiring full knowledge of each other’s strategies nor have known prior distributions on

parameters of game 4. Independence of strategies Learning in myopic theories may never converge or may converge to false beliefs Dynamic approaches have the potential to overcome this difficulty Experimentation is evaluated according to long run contribution to expected utility A player chooses randomly when and how to experiment Optimal experimentation leads to an individually optimal solution

Examples and elaboration (1)

Infinitely repeated Prisoner’s Dilemma GamePII (she)

A C a>b>c>dPI (he) A C

Discount parameter λ1 (0<λ1<1) to evaluate infinite streams of payoff Knowledge of own parameters and perfect monitoring Prior subjective beliefs probability distribution Set of pure strategies gt , t = 0,1,2,…,∞ with g∞ : the trigger strategy If not triggered earlier, gt will prescribe unprovoked aggression strategy A starting from time t on PI believes that PII is likely to cooperate by playing her grim trigger strategy but also he believes that are

positive probabilities that she will stop cooperating earlier for some reasons, he assigns her strategies g 0,g1,…,g∞ probabilities β = (β0,β1,…,β∞) that sum to 1 with βt > 0.

A best response strategy of form gT1 for T1 = 0,1,….,∞ PII holds similar beliefs ( vector α ) about PI’s strategy and chooses a strategy gT2 as her best response Now the game will be played according to the two strategies (g t1, gt2) Learning to predict the future play must occur. Players’ beliefs only converge to the truth as time goes

‚

c a

d b

Examples and elaboration (2)

Learning and Teaching 2-person infinite symmetric version of “Chicken Game” Not passive state of learning Optimizers, who believe their opponents are open to learning, may find it interesting to act as teachers Simultaneously, at the beginning and with perfect monitoring after every history, each player chooses to yield (Y), or insist (I) PII

Y I

PI Y

I

St strategy : the one that prescribes the initial yielding at time t No dominant strategies & 2 symmetric pure strategy NE (S0,S∞ : he yields immediately and she insists forever) and (S∞,S0) Vectors α and β for PII’s and PI’s beliefs respectively PI may think (putting himself partially in her shoes) that she is equally likely to wait any number of the first n periods before yielding, to see

if he would yield first or that she may insist forever with probability ε because (unlike him) she assigns a very large loss to ever yielding If future is important enough to PI, his best response to β would be to wait n periods in case she yields first, but if she does not, then yield

himself at time n+1 As long as she is willing to find out about him, he will try to convince her by his actions that he is tough If both players adopt such reasoning, a pair of strategies (sT1,sT2) will be chosen If T1=0 and T2>0, there was no attempt to teach on the part of PI and the play is in some NE If 0<T1<T2, PI failed to teach her & no NE If T1>T2, we obtain no NE and he wins

0,0 1,2

2,1 -1,-1

Main Results

Theorem 1Let f and fi n-vectors of strategies actually chosen and the beliefs of player i, respectively. If f is absolutely continuous (w.r.t. fi ), then for every ε >0 and for almost every play path z, there is a time T s.t. for all t>= T, fz(t) plays like ε-like fz(t)

I

1. This theorem states that if the vector of strategies actually chosen is absolutely continuous, then the player will learn to accurately predict the future play of the game

2. The real probability of any future history cannot differ from the beliefs of player i by more than ε

Theorem 2Let f and f1,f2,…,fn be strategy vectors respectively the one actually played and the beliefs of the players.suppose that for every player i : 1) fi is a best response to fi

-i

2) f is absolutely continuous w.r.t. fi

then for every ε>0 and for almost all play paths z there is a time T=T (z, e) s.t. for every t>=T there exists an ε-equilibrium f^ of the repeated game satisfying that fz(t) plays ε-like f^

1. In other words, given any ε >0 ,with probability 1 there will be some time T after which the players will play ε-like an ε-NE.

2. If utility maximizing players start with individual subjective beliefs, then in the long run their behavior must be essentially the same as a behavior described by an ε-NE

Dynamic fictitious play (intro)

Best Response strategy Empirical frequencies : running average of opponent actions A player jumps to the best response to the empirical frequencies of the opponent Continuous-time form of a repeated game in which players continually update strategies in response to

observations of opponent actions, without knowledge of opponent intentions A fictitious player does not have the ability to be forward looking, accurately anticipate her opponents’ play, or

understand how her behavioral rule will affect her opponents’ responses. Primary objective : how interacting players could converge to a NE By playing the optimized best response to the observed empirical frequencies, the optimizing player will

eventually converge to its own optimal response to the fixed strategy opponent If both players presumed that the other player is using a constant strategy, their update mechanisms become

intertwined FICTITIOUS PLAY (FP) The repeated game would be in equilibrium if the empirical frequencies converged Will repeated play converge to a NE??? Results that establish convergence of FP

2-player / zero sum games2-player / 2 move gamesnoisy 2-player / 2 move games with a unique NENoisy 2-player / 2 move games with countable NE2-player games where one player has only 2 moves

Empirical frequencies need not convergeShapley game (2 players/ 3 moves each)Jordan counterexample (3 players/ 2 moves each)

Dynamic fictitious play (FP setup)

STATIC GAME each player pi selects a strategy and receives a real-valued reward according to the utility function ui (pi, p-i ) u1 (p1,p2) = pTM1p2 + τH(p1) u2 (p2,p1) = pTM2p1 + ΤH(p2) τH (pi) : weighted entropy of strategy ui (p i, p*

-i) <= u i (p* i, p* -i)

pi* = βi (p* -i) : best response

Strategy of player Pi at time k is the optimal response to the running average of the opponent’s actions pi (k) =βi (q-I (k))

DYNAMIC FP Derivative action FP can lead in some cases to behaviors converging to NE in previously non convergent situations q’1 (t) = β1 (q2 (t)) – q1 (t) q’2 (t) = β2 (q1 (t)) – q2 (t) Each player’s strategy is a best response to a combination of empirical frequencies and a weighted derivative of empirical frequencies

pi (t) = βi (q-i (t) + γq’-i (t)) Exact derivative action FP (exact DAFP)

q’1 = β1 (q2 + γq’2 ) – q1

q’2 = β2 (q1 + γq’1 ) – q2 Approximate derivative action FP (approximate DAFP)

q’1 = β1 (q2 + γλ (q2 – r2 )) – q1 , with λ>0 q’2 = β2(q1 + γλ (q1– r1 )) – q2

Noisy derivative measurementsq’1 = β1 (q2 + q’2 + e2 ) – q1

q’2 = β2 (q1 + q’1+ e1) – q2 Empirical frequencies converge to a neighborhood of the set of Nash equilibria (size depends on the accuracy of the derivative

measurements)

Shapley “Fashion” Game (1)

2 players , each with 3 moves: {Red, Green, Blue} Player1: Fashion leader wants to differ from

Player2 Player2: Fashion follower wants to copy Player1 Key assumption: Players do not announce

preferences Daily routine:

– Play game

– Observe actions

– Update strategies

Shapley “Fashion” Game (2)

Empirical frequencies approach to the (unique) NE Empirical frequencies converge to wrong values

As λ increases, the oscillations are progressively

reduced

Derivative action FP can be locally convergent when standard FP is not convergent

Dynamic Gradient Play

Better Response strategy A player adjusts a current strategy in a gradient direction suggested by the empirical

frequencies of the opponent Utility function ui (pi , p-i) = pi

T Mi p-i

strategy of each player pi (t) = ΠΔ [ qi (t) + Μi q-i (t) ]A combination of a player’s own empirical frequency and a projected gradient – step using the opponent’s empirical frequencies

q’1 (t) = ΠΔ [ q1 (t) + Μ1 q2 (t) ] - q1 (t) q’2 (t) = ΠΔ [ q2 (t) + Μ2 q1(t) ] – q2 (t) Equilibrium points of continuous – time GP are precisely NE Gradient based evolution cannot converge to a completely mixed NE q’1 = ΠΔ [ q1 + Μ1 (q2 + γq’2)] - q1

q’2 = ΠΔ [ q2 + Μ2 (q1 + γq’1 )] – q2

In (ideal) case of exact DAGP there always exist a γ , s.t. a completely mixed NE is locally asymptotically stable

Stability of approximate DAGP may or may not be achieved 1. Completely mixed NE : never asymptotically stable under standard GP but derivative

action can be enable convergence2. Strict NE : approximate DAGP always results in locally stable behavior near (strict) NE

Dynamic Gradient Play

The empirical frequencies converge to completely mixed NE

Jordan anti-coordination Game

3 players, each with 2 moves: {Left, Right} Player1 wants to differ from Player2 Player2 wants to differ from Player3 Player3 wants to differ from Player1 Players do not announce preferences

Daily routine:

– Play game

– Observe actions

– Update strategies Standard FP does not converge

Learning Games Presented by: Aggelos Papazissis Alexandros Papapostolou.

Documents

Transcript of Learning Games Presented by: Aggelos Papazissis Alexandros Papapostolou.