A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP
-
Upload
rhoslyn-bronwen -
Category
Documents
-
view
29 -
download
7
description
Transcript of A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP
A reinforcement learning schemefor a multi-agent card game:het leren van een POMDP
Hajime Fujita, Yoichiro Matsuno, and Shin Ishii1. Nara Institute of Science and Technology2. Ricoh Co. Ltd.3. CREST, Japan Science and Technology Corporation
Met aanpassingen door L. Schomaker tbv KI2
1 2 1,3
23/04/19 2003 IEEE International Conference on SMC 2
Contents Introduction Preparation
Card game “Hearts” Outline of our RL scheme
Proposed method State transition on the observation
state Mean-field approximation Action control Action predictor
Computer simulation results Summary
23/04/19 2003 IEEE International Conference on SMC 3
Background Games are well-defined test-beds for studying reinforce
ment learning (RL) schemes in various multi-agent environments
Black Jack (A.Perez-Uribe and A.Sanchez, 1998) Othello (T.Yoshioka, S.Ishii and M.Ito, 1999) Backgammon (G.Tesauro, 1994) ook: het spel GO, afstudeerproject Reindert-Jan Ekker
Completely observable problems
23/04/19 2003 IEEE International Conference on SMC 4
Background Games are well-defined test-beds for studying reinforce
ment learning (RL) schemes in various multi-agent environments
Black Jack (A.Perez-Uribe and A.Sanchez, 1998) Othello (T.Yoshioka, S.Ishii and M.Ito, 1999) Backgammon (G.Tesauro, 1994)
What about partially observable problems? estimate missing information? predict environmental behaviors?
Completely observable problems
23/04/19 2003 IEEE International Conference on SMC 5
Research field: Reinf. Learning RL scheme applicable to a multi-agent environment which
is partially observablepartially observable The card game “Hearts” (Hartenjagen)
Multi-agent (four players) environment Objective is well-defined
Partially Observable Markov Decision Process (POMDP) Cards in opponents’ hands are unobservable
Realistic problem Huge state space Number of unobservable variables is large. Competitive game with four agents
Challenging study
23/04/19 2003 IEEE International Conference on SMC 6
Card game “Hearts” Hearts is a 4-player game (multi-agent
environment). Each player has 13 cards at the beginning of the
game (partially observable) Each player plays a card clock-wise Particular cards have penalty points
Object : to score as few points as possible. Players must contrive strategies to avoid these
penalty cards (competitive situation)
13 penalty points 1 penalty point
23/04/19 2003 IEEE International Conference on SMC 7
Outline of learning scheme Agent (player) predicts opponents’ actions using
acquired environmental model
The next player will probably not discard a spade. So my best action
is …
23/04/19 2003 IEEE International Conference on SMC 8
Outline of learning scheme Agent (player) predicts opponents’ actions using
acquired environmental model
Computable by brute force?
The next player will probably not discard a spade. So my best action
is …
23/04/19 2003 IEEE International Conference on SMC 9
Outline of learning scheme Agent (player) predicts opponents’ actions using
acquired environmental model
Computable by brute force? No! size of search space
unknown utility of actions unknown opponent strategies
The next player will probably not discard a spade. So my best action
is …
23/04/19 2003 IEEE International Conference on SMC 10
Outline of Reinf. Learning scheme Agent (player) predicts opponents’ actions using
acquired environmental model
Predicted using acquired environmental model
The next player will probably not discard a spade. So my best action
is …
23/04/19 2003 IEEE International Conference on SMC 11
Outline of our RL scheme Agent (player) predicts opponents’ actions using
acquired environmental model
Predicted using acquired environmental model
.. (how?).. estimate unobservable part, reinforcement learning, simulated game training
The next player will probably not discard a spade. So my best action
is …
Proposed method
State transition on the observation state Mean-field approximation Action control Action predictor
23/04/19 2003 IEEE International Conference on SMC 13
State transition on the observation state
State transition on the observation state in the game can be calculated by:
1
321 ),,(
3
11 ),,,|(),|(),,,|(
tttttt Saaa ii
iit
it
Ssttttt KHxaPKHsPKHaxP
23/04/19 2003 IEEE International Conference on SMC 14
State transition on the observation state
State transition on the observation state in the game can be calculated by:
),,,|( 1 KHaxP ttt
x observation (cards in hand+cards on table)a action (card to be played)s state (all observable and onobservable cards)Ф strategies of each of the opponents Ht history of all x and a until time tK knowledge of the game
23/04/19 2003 IEEE International Conference on SMC 15
Voorbeelden a: “harten-2 opgooien” s:
[niet observeerbaar deel] Oost heeft kaarten u,v,w,…,z West heeft kaarten a,b,… Noord heeft kaarten r,s,…
[observeerbaar deel= x] Ik heb kaarten f,g,… op tafel liggen kaarten k,l,…
Ht: {{s0,a0}west,{s1,a1}noord
,…,{st,at}oost }
23/04/19 2003 IEEE International Conference on SMC 16
State transition on the observation state
State transition on the observation state in the game can be calculated by:
1
321 ),,(
3
11 ),,,|(),|(),,,|(
tttttt Saaa it
iit
it
Ssttttt KHxaPKHsPKHaxP
De kans op een bepaalde hand en uitgegooide kaarten op t+1 is:het produkt van de {som van de kans op alle mogelijke kaartverdelingen gegeven de historie op t en spelkennis K} met de {som van de producten van de kansen op alle mogelijke acties voor opponenten 1-3, gegeven elk hun strategie en de historie)
23/04/19 2003 IEEE International Conference on SMC 17
State transition on the observation state
State transition on the observation state in the game can be calculated by:
1
321 ),,(
3
11 ),,,|(),|(),,,|(
tttttt Saaa it
iit
it
Ssttttt KHxaPKHsPKHaxP
De kans op een bepaalde hand en uitgegooide kaarten op t+1 is:het produkt van de {som van de kans op alle mogelijke kaartverdelingen gegeven de historie op t en spelkennis K} met de {som van de producten van de kansen op alle mogelijke acties voor opponenten 1-3, gegeven elk hun strategie en de historie)
23/04/19 2003 IEEE International Conference on SMC 18
State transition on the observation state
State transition on the observation state can be calculated by:
Calculation is intractable Hearts has very huge state space. About states !2710
Need approximation
1
321 ),,(
3
11 ),,,|(),|(),,,|(
tttttt Saaa ii
iit
it
Ssttttt KHxaPKHsPKHaxP
Summation of all states … (?)….
23/04/19 2003 IEEE International Conference on SMC 19
State transition on the observation state
State transition on the observation state about game of Hearts can be calculated by:
Calculation is intractable Hearts has very huge state space. About states ! 2710
Need approximation
1
321 ),,(
3
11 ),,,|(),|(),,,|(
tttttt Saaa ii
iit
it
Ssttttt KHxaPKHsPKHaxP
Summation of all states … (?)….aantal manieren om 52 kaarten over 4 spelers te verdelen zodat elk 13 kaarten heeft
23/04/19 2003 IEEE International Conference on SMC 20
Mean-field approximation Calculate mean estimated observation state for the oppon
ent agent.
Een geschatte observatietoestand voor een opponent i is een gewogen som van de kans op observaties xt, gegeven een actie, een historie (en spelkennis K)
de (deel)kansen worden bekend gedurende het spel
itx
ttit
ittt
it KHaxPyKHay ),,|(),,(ˆ
23/04/19 2003 IEEE International Conference on SMC 21
Mean-field approximation Calculate mean estimated observation state for
the opponent agent.
Transition probability is approximated as
itx
ttit
ittt
it KHaxPyKHay ),,|(),,(ˆ
1
321 ),,(
3
11 ),,,|(),|(),,,|(
tttttt Saaa ii
iit
it
Ssttttt KHxaPKHsPKHaxP
1
321 ),,(
3
11 )ˆ),,,(ˆ|(),,,|(
tttt Saaa i
itt
it
itttt KHayaPKHaxP
mean observation state
23/04/19 2003 IEEE International Conference on SMC 22
Mean-field approximation Calculate mean estimated observation state for
the opponent agent.
Transition probability is approximated as
itx
ttit
ittt
it KHaxPyKHay ),,|(),,(ˆ
zodat de kansverdeling van de voorwaardelijke kans op een actie door opponent i kan worden bepaald, dwz: gegeven diens geschatte “unobservable state”
1
321 ),,(
3
11 )ˆ),,,(ˆ|(),,,|(
tttt Saaa i
itt
it
itttt KHayaPKHaxP
mean observation state
23/04/19 2003 IEEE International Conference on SMC 23
Action control: TD Reinforcement Learning
An action is selected based on the expected TD error
where
Using the expected TD error, action selection probability is given by
)()()( 11 ttttttt xVaxVaxRa
1
)(),,,|()( 111
txtttttt xfKHaxPaxf
Aamkt
mtttt
k
Ta
TaxaP
)exp(
)exp()|(
23/04/19 2003 IEEE International Conference on SMC 24
Action prediction
We use a function approximator (NGnet) for the utility function which is likely to be non-linear
Function approximators can be trained by using past games
1
321 ),,(
3
11 )ˆ),,,(ˆ|(),,,|(
tttt Saaa i
itt
it
itttt KHayaPKHaxP
Aa
iittt
it
i
iittt
it
ii
ttit
it
it
TaKHayU
TaKHayUKHayaP
))),,,(ˆ(exp(
))),,,(ˆ(exp()),,,(ˆ|(
iU
23/04/19 2003 IEEE International Conference on SMC 25
Summary of proposed method
RL scheme based on Estimation of unobservable state variables Prediction of opponent agents’ actions
・・・
・・・
・・・ ・・・
・・・
・・・
)ˆ),,,(ˆ|( itt
it
it KHayaP
Estimation of unobservable state variables by mean-field approximation Learning agent determines its action based on prediction by environmental behaviors
Computer simulations
Rule-based agent Single agent learning in a stationary
environment Learning by multiple agents in a multi-agent
environment
23/04/19 2003 IEEE International Conference on SMC 27
Computer simulations Three experiments to evaluate learning agent by using a ru
le-based agent Single agent learning in a stationary environment
(A) learning agent, rule-based agent x3 Learning by multiple agents in a multi-agent environment
(B) learning agent, actor-critic agent, rule-based agent x2 (C) learning agent x2, rule-based agent x2
A rule-based agent has more than 50 rules, and it is an “experienced” level Hearts player.
23/04/19 2003 IEEE International Conference on SMC 28
Proposed RL agent
Rule-based agent x3
Avera
ge p
en
alt
y
rati
o
Number of games
better player
23/04/19 2003 IEEE International Conference on SMC 29Number of games
Avera
ge p
en
alt
y
rati
o
Proposed RL agent
Rule-based agent x2
Actor-critic agent
better player
23/04/19 2003 IEEE International Conference on SMC 30
Avera
ge p
en
alt
y
rati
o
Number of games
Rule-based agent x2
Proposed RL agent x2
better player
23/04/19 2003 IEEE International Conference on SMC 31
Summary We proposed a RL scheme for making an
autonomous learning agent that plays the multi-player card game “Hearts”.
Our RL agent estimates unobservable state variables using mean-field approximation, learns and predicts environmental behaviors.
Computer simulations showed our method is applicable to a realistic multi-agent problem.
NAra Institute of Science and Technology (NAIST)Hajime FUJITA
[email protected]://hawaii.aist-nara.ac.jp/~hajime-f/