Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer...

Reinforcement Learning on Markov Games

Nilanjan DasguptaDepartment of Electrical and Computer Engineering

Duke UniversityDurham, NC 27708

Machine Learning Seminar Series

Overview

Markov Decision Processes (MDP) and Markov games.

Optimal policy search : Value iteration (VI), Policy iteration (PI), Reinforcement learning (RL).

Minimax-Q learning for zero-sum (ZS) games.

Quantitative analysis Minimax-Q and Q-learning algorithms for ZS games.

Cons of Minimax-Q and development of Nash-Q learning algorithm.

Constraints of Nash-Q and development of Friend-or-Foe Q-learning.

Brief discussion on Partially Observed Stochastic games POSG : POMDP for Multi-agent stochastic games.

MDP and Markov Games

• single agents operating in a fixed environment (“world”)

• It is represented by a tuple {S, A, T, R}

• The agent’s objective is to find the optimal strategy mapping to maximize the expected reward

• Multiple agents operating in an environment.

• It is represented by a tuple {S, A1,…, An, T, R1,…, Rn}

• Agent i’s objective is to maximize the expected reward

T : S X A p(S)R : S X A R

N

j jtjrE

0}{

N : Length of horizon : discount factor

)(ApS

MDP

T : S X A1 x…x An p(S)Ri : S X A1 x …x An R

N

j jtijrE

0 , }{

Markov Games

Markov Games

0,0 1,-1 -1,1

-1,1 0,0 1,-1

1,-1 -1,1 0,0

rock paper wood

rock

paper

wood

Opponent

MG = {S, A1,…, An , T, R1,…, Rn}

• When |S|=1 (single-state), Markov Game is represented by Matrix games.

• When (single agent), Markov Game is represented by an MDP.

• For a two-player zero-sum (ZS) game, there is a single reward function with agents having diametrically opposite goals.

• For example, a two-player zero-sum matrix game as shown below :

11 nAn

Agent

Optimal policy : Matrix Games

0 1 -1

-1 0 1

1 -1 0

rock paper wood

rock

paper

wood

• Ro,a is the reward for the agent for taking action a with opponent taking action o.

• Agent strives to maximize the expected reward while opponent tries to minimize it.

• For the strategy to be optimal it needs to satisfy

i.e., find the strategy for the agent that has the best “worst-case” scenario.

Aa aaoOoAp

R

,)(

* minmaxarg

Optimal policy : MDP & Markov Games

• There exists hosts of methods such as value iteration, policy iteration (both assumes complete knowledge of T), reinforcement learning (RL).

• Value iteration (VI) :

• Use of Dynamic-Programming to estimate value functions and convergence is guaranteed [Bertsekas, 1987]).

)',(maxarg

),',(max)(

)'()',,(),(),(

'

*

'

'

asQa

asQsV

sVsasTasRasQ

Aa

Aa

Ss

Aa aOoAp

Aa aOoAp

Ss

oasQ

oasQsV

sVsoasToasRoasQ

),,(minmaxarg

,),,(minmax)(

)'()',,,(),,(),,(

)(

*

)(

'

MDP Markov Games

Optimal policy : Markov Games

)',(maxarg

),',(max)(

)'()',,(),(),(

'

*

'

'

asQa

asQsV

sVsasTasRasQ

Aa

Aa

Ss

Aa aOoAp

Aa aOoAp

Ss

oasQa

oasQsV

sVsoasTasRasQ

),,(minmaxarg

,),,(minmax)(

)'()',,,(),(),(

)(

*

)(

'

MDP Markov Games

Note• Every MDP has atleast one stationary, deterministic optimal policy.

• There may not exist an optimal stationary deterministic policy for MG.

• The reason being the agents uncertainty in guessing the opponents move exactly, specially when agents are making simultaneous moves, unlike tic-tac-toe etc.

Learning Optimal policy : Reinforcement Learning

• First developed by Watkins in 1989 for optimal policy learning in MDP without explicit knowledge of T.

• Agents receives a reward r while making transition from s to s’ by taking action a

• T(s,a,s’) is implicitly involved in the above state transition.

• Minimax-Q utilizes the same principle of Q-learning in the two-player ZS game.

))'((),()1(),( sVrasQasQ

''

'')(

)',,(),(*min)(

)',,(),(minmaxarg),(*

))'((),,()1(),,(

ao

aoA

oasQassV

oasQasas

sVroasQoasQ

Via LP

Performance of Minimax-Q

• Software Soccer game on a 4x5 grid.

• 20 states and 5 actions {N, S, E, W and Stand}.

• Four different policies,

1. Minimax-Q trained against random opponent2. Minimax-Q trained against Minimax-Q opponent3. Q trained against random opponent4. Q trained against random opponent

Constraints of Minimax-Q : Nash-Q

• Convergence is guaranteed only for two-player zero-sum games.

• Nash-Q, proposed by Hu & Wellman ’98, maintains a set of approximate Q functions and updates them as

nn Aa nkkni

Aani

niitnitni

aasQsQ

sQraasQaasQ

,..,1

*11

111

),...,,(...),...,,'(

)),...,,'((),...,,()1(),...,,(

11

• Note that Minimax-Q learning scheme was

))'((),,()1(),,( sVroasQoasQ

• – One-stage Nash Equilibrium policy of player k with current estimates of {Q1,…,Qn}.

*k

• Check that there exists two situations for which no person can single-handedly change his/her action to increase their respective payoffs.

• NE : (opera, opera) and (fight, fight).

• For an n-player Normal form Game, represents the Nash equilibrium iff

• Two types of Nash equilibrium : coordinated and adversarial.

• All normal-form non-cooperative games have Nash equilibrium, some may be mixed-strategy.

Single-stage Nash Equilibrium

Let’s explain via a classic example, The Battle of the Sexes

1,2 0,0

0,0 2,1

Opera Fight

Opera

Fight

Pat

Chris

},...,{ **1 n

nirr niinii ,..,1),...,,...,(),...,,...,( **1

***1

1,2 0,4

4,0 2,1

Analysis of Nash-Q Learning

)),...,,'((),...,,()1(),...,,( **111 niitnitni sQraasQaasQ

• Why update in such a way ? • For 1-player game (MDP), Nash-Q is simple maximization – Q-learning.• For zero-sum games, Nash-Q is Minimax-Q – guaranteed convergence.

• Cons – Nash equilibrium is not unique (multiple equilibrium point exists), hence convergence is not guaranteed.

• Guaranteed to work when• There exists an unique coordinated equilibrium for the entire game and for each game defined by the Q-functions during the entire learning.

• There exists an unique adversarial equilibrium for the entire game and for each game defined by the Q-functions during the entire learning.

Or,

Relaxing constraints of Nash-Q : Friend-or-Foe Q

• Uniqueness of NE is relaxed in FFQ, but the algorithm needs to know the nature of the opponents : “friend” (coordinated equilibrium), or

“foe” (adversarial equilibrium).

foeaisopponenttheif),,()(minmax),,'(

friendaisopponenttheif),,(max),,'(

)),...,,'((),...,,()1(),...,,(

21111)(

*2

*11

211,

*2

*11

**111

112211

2211

aasQasQ

aasQsQ

sQraasQaasQ

AaAaA

AaAa

niitnitni

• There exists convergence guarantee for Friend-or-foe Q-leaning algorithm.

Analysis of FFQ Learning

• FFQ provides a RL-based strategy learning in multi-player general-sum games.

• Like Nash-Q, it should not be expected to find a Nash-equilibrium unless either coordinated or adversarial equilibrium exists.

• Unlike Nash-Q, FFQ does not require learning Q-estimates of all the players in the game, but only for its own.

• FFQ restrictions are much weaker : doesn’t require NE to be unique all along.

• Both FFQ and Nash-Q fails for games having no NE (infinite games).

),...,,,...,,(minmax),,'(:

)),...,,'((),...,,()1(),...,,(:

11...,...,),...,(

*2

*1

**111

1111lki

YYyyXXi

niitnitni

yyxxsQsQFFQ

sQraasQaasQNashQ

llk

Partial Observability of States : POSG

• Entire discussion assumed the states to be known, although the transition probability and reward functions can be learned asynchronously (RL).

• Partially Observed Stochastic Games (POSG) assumes the underlying states to be partially observed via observations.

• Stochastic Games are analogous to MDPs, so is learning via RL.

• POMDP can be interpreted as MDP over belief space, with increased complexity due to continuous belief space. But POSG cannot be solved by transforming it to stochastic games over belief spaces, since each agent’s belief is potentially different.

• E. A. Hansen et al. proposes a policy iteration approach for POSG that alleviates the scaling issue for a finite-horizon case via iterative elimination of dominated strategies (policies).

Summary

• Theory of MDP and Markov Games are strongly correlated.

• Minimax-Q learning is a Q-learning scheme proposed for two-player ZS games.

• Minimax-Q is very conservative in its action, since it chooses a strategy that maximizes the worst-case performance of the agent.

• Nash-Q is developed for multi-player, general-sum games but converges only under strict restrictions (existence and uniqueness of NE).

• FFQ relaxes the restrictions (uniqueness) a bit, but not much.

• Most algorithms are reactive i.e., each agents lets others to choose an equilibrium point and then learns its best response.

• In partial observability of states, the problem is not yet scalable.

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer...

Documents

Transcript of Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer...