A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

A reinforcement learning schemefor a multi-agent card game:het leren van een POMDP

Hajime Fujita, Yoichiro Matsuno, and Shin Ishii1. Nara Institute of Science and Technology2. Ricoh Co. Ltd.3. CREST, Japan Science and Technology Corporation

Met aanpassingen door L. Schomaker tbv KI2

1 2 1,3

23/04/19 2003 IEEE International Conference on SMC 2

Contents Introduction Preparation

Card game “Hearts” Outline of our RL scheme

Proposed method State transition on the observation

state Mean-field approximation Action control Action predictor

Computer simulation results Summary


Background Games are well-defined test-beds for studying reinforce

ment learning (RL) schemes in various multi-agent environments

Black Jack (A.Perez-Uribe and A.Sanchez, 1998) Othello (T.Yoshioka, S.Ishii and M.Ito, 1999) Backgammon (G.Tesauro, 1994) ook: het spel GO, afstudeerproject Reindert-Jan Ekker

Completely observable problems


Background Games are well-defined test-beds for studying reinforce

ment learning (RL) schemes in various multi-agent environments

Black Jack (A.Perez-Uribe and A.Sanchez, 1998) Othello (T.Yoshioka, S.Ishii and M.Ito, 1999) Backgammon (G.Tesauro, 1994)

What about partially observable problems? estimate missing information? predict environmental behaviors?

Completely observable problems


Research field: Reinf. Learning RL scheme applicable to a multi-agent environment which

is partially observablepartially observable The card game “Hearts” (Hartenjagen)

Multi-agent (four players) environment Objective is well-defined

Partially Observable Markov Decision Process (POMDP) Cards in opponents’ hands are unobservable

Realistic problem Huge state space Number of unobservable variables is large. Competitive game with four agents

Challenging study


Card game “Hearts” Hearts is a 4-player game (multi-agent

environment). Each player has 13 cards at the beginning of the

game (partially observable) Each player plays a card clock-wise Particular cards have penalty points

Object : to score as few points as possible. Players must contrive strategies to avoid these

penalty cards (competitive situation)

13 penalty points 1 penalty point


Outline of learning scheme Agent (player) predicts opponents’ actions using

acquired environmental model

The next player will probably not discard a spade. So my best action

is …




Computable by brute force?


is …




Computable by brute force? No! size of search space

unknown utility of actions unknown opponent strategies


is …


Outline of Reinf. Learning scheme Agent (player) predicts opponents’ actions using


Predicted using acquired environmental model


is …


Outline of our RL scheme Agent (player) predicts opponents’ actions using


Predicted using acquired environmental model

.. (how?).. estimate unobservable part, reinforcement learning, simulated game training


is …

Proposed method

State transition on the observation state Mean-field approximation Action control Action predictor


State transition on the observation state

State transition on the observation state in the game can be calculated by:

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa ii

iit

it

Ssttttt KHxaPKHsPKHaxP




),,,|( 1 KHaxP ttt

x observation (cards in hand+cards on table)a action (card to be played)s state (all observable and onobservable cards)Ф strategies of each of the opponents Ht history of all x and a until time tK knowledge of the game


Voorbeelden a: “harten-2 opgooien” s:

[niet observeerbaar deel] Oost heeft kaarten u,v,w,…,z West heeft kaarten a,b,… Noord heeft kaarten r,s,…

[observeerbaar deel= x] Ik heb kaarten f,g,… op tafel liggen kaarten k,l,…

Ht: {{s0,a0}west,{s1,a1}noord

,…,{st,at}oost }




1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa it

iit

it


De kans op een bepaalde hand en uitgegooide kaarten op t+1 is:het produkt van de {som van de kans op alle mogelijke kaartverdelingen gegeven de historie op t en spelkennis K} met de {som van de producten van de kansen op alle mogelijke acties voor opponenten 1-3, gegeven elk hun strategie en de historie)



State transition on the observation state can be calculated by:

Calculation is intractable Hearts has very huge state space. About states !2710

Need approximation

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa ii

iit

it


Summation of all states … (?)….



State transition on the observation state about game of Hearts can be calculated by:

Calculation is intractable Hearts has very huge state space. About states ! 2710

Need approximation

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa ii

iit

it


Summation of all states … (?)….aantal manieren om 52 kaarten over 4 spelers te verdelen zodat elk 13 kaarten heeft


Mean-field approximation Calculate mean estimated observation state for the oppon

ent agent.

Een geschatte observatietoestand voor een opponent i is een gewogen som van de kans op observaties xt, gegeven een actie, een historie (en spelkennis K)

de (deel)kansen worden bekend gedurende het spel

itx

ttit

ittt

it KHaxPyKHay ),,|(),,(ˆ


Mean-field approximation Calculate mean estimated observation state for

the opponent agent.

Transition probability is approximated as

itx

ttit

ittt


1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa ii

iit

it


1

321 ),,(

3

11 )ˆ),,,(ˆ|(),,,|(

tttt Saaa i

itt

it

itttt KHayaPKHaxP

mean observation state


Mean-field approximation Calculate mean estimated observation state for

the opponent agent.

Transition probability is approximated as

itx

ttit

ittt


zodat de kansverdeling van de voorwaardelijke kans op een actie door opponent i kan worden bepaald, dwz: gegeven diens geschatte “unobservable state”

1

321 ),,(

3

11 )ˆ),,,(ˆ|(),,,|(

tttt Saaa i

itt

it

itttt KHayaPKHaxP

mean observation state


Action control: TD Reinforcement Learning

An action is selected based on the expected TD error

where

Using the expected TD error, action selection probability is given by

)()()( 11 ttttttt xVaxVaxRa

1

)(),,,|()( 111

txtttttt xfKHaxPaxf

Aamkt

mtttt

k

Ta

TaxaP

)exp(

)exp()|(


Action prediction

We use a function approximator (NGnet) for the utility function which is likely to be non-linear

Function approximators can be trained by using past games

1

321 ),,(

3

11 )ˆ),,,(ˆ|(),,,|(

tttt Saaa i

itt

it

itttt KHayaPKHaxP

Aa

iittt

it

i

iittt

it

ii

ttit

it

it

TaKHayU

TaKHayUKHayaP

))),,,(ˆ(exp(

))),,,(ˆ(exp()),,,(ˆ|(

iU


Summary of proposed method

RL scheme based on Estimation of unobservable state variables Prediction of opponent agents’ actions

・・・

・・・

・・・・・・

・・・

・・・

)ˆ),,,(ˆ|( itt

it

it KHayaP

Estimation of unobservable state variables by mean-field approximation Learning agent determines its action based on prediction by environmental behaviors

Computer simulations

Rule-based agent Single agent learning in a stationary

environment Learning by multiple agents in a multi-agent

environment


Computer simulations Three experiments to evaluate learning agent by using a ru

le-based agent Single agent learning in a stationary environment

(A) learning agent, rule-based agent x3 Learning by multiple agents in a multi-agent environment

(B) learning agent, actor-critic agent, rule-based agent x2 (C) learning agent x2, rule-based agent x2

A rule-based agent has more than 50 rules, and it is an “experienced” level Hearts player.


Proposed RL agent

Rule-based agent x3

Avera

ge p

en

alt

y

rati

o

Number of games

better player

23/04/19 2003 IEEE International Conference on SMC 29Number of games

Avera

ge p

en

alt

y

rati

o

Proposed RL agent

Rule-based agent x2

Actor-critic agent

better player


Avera

ge p

en

alt

y

rati

o

Number of games

Rule-based agent x2

Proposed RL agent x2

better player


Summary We proposed a RL scheme for making an

autonomous learning agent that plays the multi-player card game “Hearts”.

Our RL agent estimates unobservable state variables using mean-field approximation, learns and predicts environmental behaviors.

Computer simulations showed our method is applicable to a realistic multi-agent problem.

NAra Institute of Science and Technology (NAIST)Hajime FUJITA

[email protected]://hawaii.aist-nara.ac.jp/~hajime-f/

A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

Documents

Transcript of A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP