Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer...
-
Upload
hester-charles -
Category
Documents
-
view
215 -
download
0
Transcript of Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer...
Reinforcement Learning on Markov Games
Nilanjan DasguptaDepartment of Electrical and Computer Engineering
Duke UniversityDurham, NC 27708
Machine Learning Seminar Series
Overview
Markov Decision Processes (MDP) and Markov games.
Optimal policy search : Value iteration (VI), Policy iteration (PI), Reinforcement learning (RL).
Minimax-Q learning for zero-sum (ZS) games.
Quantitative analysis Minimax-Q and Q-learning algorithms for ZS games.
Cons of Minimax-Q and development of Nash-Q learning algorithm.
Constraints of Nash-Q and development of Friend-or-Foe Q-learning.
Brief discussion on Partially Observed Stochastic games POSG : POMDP for Multi-agent stochastic games.
MDP and Markov Games
• single agents operating in a fixed environment (“world”)
• It is represented by a tuple {S, A, T, R}
• The agent’s objective is to find the optimal strategy mapping to maximize the expected reward
• Multiple agents operating in an environment.
• It is represented by a tuple {S, A1,…, An, T, R1,…, Rn}
• Agent i’s objective is to maximize the expected reward
T : S X A p(S)R : S X A R
N
j jtjrE
0}{
N : Length of horizon : discount factor
)(ApS
MDP
T : S X A1 x…x An p(S)Ri : S X A1 x …x An R
N
j jtijrE
0 , }{
Markov Games
Markov Games
0,0 1,-1 -1,1
-1,1 0,0 1,-1
1,-1 -1,1 0,0
rock paper wood
rock
paper
wood
Opponent
MG = {S, A1,…, An , T, R1,…, Rn}
• When |S|=1 (single-state), Markov Game is represented by Matrix games.
• When (single agent), Markov Game is represented by an MDP.
• For a two-player zero-sum (ZS) game, there is a single reward function with agents having diametrically opposite goals.
• For example, a two-player zero-sum matrix game as shown below :
11 nAn
Agent
Optimal policy : Matrix Games
0 1 -1
-1 0 1
1 -1 0
rock paper wood
rock
paper
wood
• Ro,a is the reward for the agent for taking action a with opponent taking action o.
• Agent strives to maximize the expected reward while opponent tries to minimize it.
• For the strategy to be optimal it needs to satisfy
i.e., find the strategy for the agent that has the best “worst-case” scenario.
Aa aaoOoAp
R
,)(
* minmaxarg
Optimal policy : MDP & Markov Games
• There exists hosts of methods such as value iteration, policy iteration (both assumes complete knowledge of T), reinforcement learning (RL).
• Value iteration (VI) :
• Use of Dynamic-Programming to estimate value functions and convergence is guaranteed [Bertsekas, 1987]).
)',(maxarg
),',(max)(
)'()',,(),(),(
'
*
'
'
asQa
asQsV
sVsasTasRasQ
Aa
Aa
Ss
Aa aOoAp
Aa aOoAp
Ss
oasQ
oasQsV
sVsoasToasRoasQ
),,(minmaxarg
,),,(minmax)(
)'()',,,(),,(),,(
)(
*
)(
'
MDP Markov Games
Optimal policy : Markov Games
)',(maxarg
),',(max)(
)'()',,(),(),(
'
*
'
'
asQa
asQsV
sVsasTasRasQ
Aa
Aa
Ss
Aa aOoAp
Aa aOoAp
Ss
oasQa
oasQsV
sVsoasTasRasQ
),,(minmaxarg
,),,(minmax)(
)'()',,,(),(),(
)(
*
)(
'
MDP Markov Games
Note• Every MDP has atleast one stationary, deterministic optimal policy.
• There may not exist an optimal stationary deterministic policy for MG.
• The reason being the agents uncertainty in guessing the opponents move exactly, specially when agents are making simultaneous moves, unlike tic-tac-toe etc.
Learning Optimal policy : Reinforcement Learning
• First developed by Watkins in 1989 for optimal policy learning in MDP without explicit knowledge of T.
• Agents receives a reward r while making transition from s to s’ by taking action a
• T(s,a,s’) is implicitly involved in the above state transition.
• Minimax-Q utilizes the same principle of Q-learning in the two-player ZS game.
))'((),()1(),( sVrasQasQ
''
'')(
)',,(),(*min)(
)',,(),(minmaxarg),(*
))'((),,()1(),,(
ao
aoA
oasQassV
oasQasas
sVroasQoasQ
Via LP
Performance of Minimax-Q
• Software Soccer game on a 4x5 grid.
• 20 states and 5 actions {N, S, E, W and Stand}.
• Four different policies,
1. Minimax-Q trained against random opponent2. Minimax-Q trained against Minimax-Q opponent3. Q trained against random opponent4. Q trained against random opponent
Constraints of Minimax-Q : Nash-Q
• Convergence is guaranteed only for two-player zero-sum games.
• Nash-Q, proposed by Hu & Wellman ’98, maintains a set of approximate Q functions and updates them as
nn Aa nkkni
Aani
niitnitni
aasQsQ
sQraasQaasQ
,..,1
*11
111
),...,,(...),...,,'(
)),...,,'((),...,,()1(),...,,(
11
• Note that Minimax-Q learning scheme was
))'((),,()1(),,( sVroasQoasQ
• – One-stage Nash Equilibrium policy of player k with current estimates of {Q1,…,Qn}.
*k
• Check that there exists two situations for which no person can single-handedly change his/her action to increase their respective payoffs.
• NE : (opera, opera) and (fight, fight).
• For an n-player Normal form Game, represents the Nash equilibrium iff
• Two types of Nash equilibrium : coordinated and adversarial.
• All normal-form non-cooperative games have Nash equilibrium, some may be mixed-strategy.
Single-stage Nash Equilibrium
Let’s explain via a classic example, The Battle of the Sexes
1,2 0,0
0,0 2,1
Opera Fight
Opera
Fight
Pat
Chris
},...,{ **1 n
nirr niinii ,..,1),...,,...,(),...,,...,( **1
***1
1,2 0,4
4,0 2,1
Analysis of Nash-Q Learning
)),...,,'((),...,,()1(),...,,( **111 niitnitni sQraasQaasQ
• Why update in such a way ? • For 1-player game (MDP), Nash-Q is simple maximization – Q-learning.• For zero-sum games, Nash-Q is Minimax-Q – guaranteed convergence.
• Cons – Nash equilibrium is not unique (multiple equilibrium point exists), hence convergence is not guaranteed.
• Guaranteed to work when• There exists an unique coordinated equilibrium for the entire game and for each game defined by the Q-functions during the entire learning.
• There exists an unique adversarial equilibrium for the entire game and for each game defined by the Q-functions during the entire learning.
Or,
Relaxing constraints of Nash-Q : Friend-or-Foe Q
• Uniqueness of NE is relaxed in FFQ, but the algorithm needs to know the nature of the opponents : “friend” (coordinated equilibrium), or
“foe” (adversarial equilibrium).
foeaisopponenttheif),,()(minmax),,'(
friendaisopponenttheif),,(max),,'(
)),...,,'((),...,,()1(),...,,(
21111)(
*2
*11
211,
*2
*11
**111
112211
2211
aasQasQ
aasQsQ
sQraasQaasQ
AaAaA
AaAa
niitnitni
• There exists convergence guarantee for Friend-or-foe Q-leaning algorithm.
Analysis of FFQ Learning
• FFQ provides a RL-based strategy learning in multi-player general-sum games.
• Like Nash-Q, it should not be expected to find a Nash-equilibrium unless either coordinated or adversarial equilibrium exists.
• Unlike Nash-Q, FFQ does not require learning Q-estimates of all the players in the game, but only for its own.
• FFQ restrictions are much weaker : doesn’t require NE to be unique all along.
• Both FFQ and Nash-Q fails for games having no NE (infinite games).
),...,,,...,,(minmax),,'(:
)),...,,'((),...,,()1(),...,,(:
11...,...,),...,(
*2
*1
**111
1111lki
YYyyXXi
niitnitni
yyxxsQsQFFQ
sQraasQaasQNashQ
llk
Partial Observability of States : POSG
• Entire discussion assumed the states to be known, although the transition probability and reward functions can be learned asynchronously (RL).
• Partially Observed Stochastic Games (POSG) assumes the underlying states to be partially observed via observations.
• Stochastic Games are analogous to MDPs, so is learning via RL.
• POMDP can be interpreted as MDP over belief space, with increased complexity due to continuous belief space. But POSG cannot be solved by transforming it to stochastic games over belief spaces, since each agent’s belief is potentially different.
• E. A. Hansen et al. proposes a policy iteration approach for POSG that alleviates the scaling issue for a finite-horizon case via iterative elimination of dominated strategies (policies).
Summary
• Theory of MDP and Markov Games are strongly correlated.
• Minimax-Q learning is a Q-learning scheme proposed for two-player ZS games.
• Minimax-Q is very conservative in its action, since it chooses a strategy that maximizes the worst-case performance of the agent.
• Nash-Q is developed for multi-player, general-sum games but converges only under strict restrictions (existence and uniqueness of NE).
• FFQ relaxes the restrictions (uniqueness) a bit, but not much.
• Most algorithms are reactive i.e., each agents lets others to choose an equilibrium point and then learns its best response.
• In partial observability of states, the problem is not yet scalable.