A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

32
A reinforcement learning scheme for a multi-agent card g ame: het leren van een POMDP Hajime Fujita, Yoichiro Matsuno, an d Shin Ishii 1. Nara Institute of Science and Technology 2. Ricoh Co. Ltd. 3. CREST, Japan Science and Technology Corporation Met aanpassingen door L. Schomaker tbv KI2 1 2 1,3

description

A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP. 1,3. 1. 2. Hajime Fujita, Yoichiro Matsuno, and Shin Ishii. 1. Nara Institute of Science and Technology 2. Ricoh Co. Ltd. 3. CREST, Japan Science and Technology Corporation - PowerPoint PPT Presentation

Transcript of A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

Page 1: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

A reinforcement learning schemefor a multi-agent card game:het leren van een POMDP

Hajime Fujita, Yoichiro Matsuno, and Shin Ishii1. Nara Institute of Science and Technology2. Ricoh Co. Ltd.3. CREST, Japan Science and Technology Corporation

Met aanpassingen door L. Schomaker tbv KI2

1 2 1,3

Page 2: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 2

Contents Introduction Preparation

Card game “Hearts” Outline of our RL scheme

Proposed method State transition on the observation

state Mean-field approximation Action control Action predictor

Computer simulation results Summary

Page 3: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 3

Background Games are well-defined test-beds for studying reinforce

ment learning (RL) schemes in various multi-agent environments

Black Jack (A.Perez-Uribe and A.Sanchez, 1998) Othello (T.Yoshioka, S.Ishii and M.Ito, 1999) Backgammon (G.Tesauro, 1994) ook: het spel GO, afstudeerproject Reindert-Jan Ekker

Completely observable problems

Page 4: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 4

Background Games are well-defined test-beds for studying reinforce

ment learning (RL) schemes in various multi-agent environments

Black Jack (A.Perez-Uribe and A.Sanchez, 1998) Othello (T.Yoshioka, S.Ishii and M.Ito, 1999) Backgammon (G.Tesauro, 1994)

What about partially observable problems? estimate missing information? predict environmental behaviors?

Completely observable problems

Page 5: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 5

Research field: Reinf. Learning RL scheme applicable to a multi-agent environment which

is partially observablepartially observable The card game “Hearts” (Hartenjagen)

Multi-agent (four players) environment Objective is well-defined

Partially Observable Markov Decision Process (POMDP) Cards in opponents’ hands are unobservable

Realistic problem Huge state space Number of unobservable variables is large. Competitive game with four agents

Challenging study

Page 6: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 6

Card game “Hearts” Hearts is a 4-player game (multi-agent

environment). Each player has 13 cards at the beginning of the

game (partially observable) Each player plays a card clock-wise Particular cards have penalty points

Object : to score as few points as possible. Players must contrive strategies to avoid these

penalty cards (competitive situation)

13 penalty points 1 penalty point

Page 7: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 7

Outline of learning scheme Agent (player) predicts opponents’ actions using

acquired environmental model

The next player will probably not discard a spade. So my best action

is …

Page 8: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 8

Outline of learning scheme Agent (player) predicts opponents’ actions using

acquired environmental model

Computable by brute force?

The next player will probably not discard a spade. So my best action

is …

Page 9: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 9

Outline of learning scheme Agent (player) predicts opponents’ actions using

acquired environmental model

Computable by brute force? No! size of search space

unknown utility of actions unknown opponent strategies

The next player will probably not discard a spade. So my best action

is …

Page 10: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 10

Outline of Reinf. Learning scheme Agent (player) predicts opponents’ actions using

acquired environmental model

Predicted using acquired environmental model

The next player will probably not discard a spade. So my best action

is …

Page 11: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 11

Outline of our RL scheme Agent (player) predicts opponents’ actions using

acquired environmental model

Predicted using acquired environmental model

.. (how?).. estimate unobservable part, reinforcement learning, simulated game training

The next player will probably not discard a spade. So my best action

is …

Page 12: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

Proposed method

State transition on the observation state Mean-field approximation Action control Action predictor

Page 13: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 13

State transition on the observation state

State transition on the observation state in the game can be calculated by:

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa ii

iit

it

Ssttttt KHxaPKHsPKHaxP

Page 14: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 14

State transition on the observation state

State transition on the observation state in the game can be calculated by:

),,,|( 1 KHaxP ttt

x observation (cards in hand+cards on table)a action (card to be played)s state (all observable and onobservable cards)Ф strategies of each of the opponents Ht history of all x and a until time tK knowledge of the game

Page 15: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 15

Voorbeelden a: “harten-2 opgooien” s:

[niet observeerbaar deel] Oost heeft kaarten u,v,w,…,z West heeft kaarten a,b,… Noord heeft kaarten r,s,…

[observeerbaar deel= x] Ik heb kaarten f,g,… op tafel liggen kaarten k,l,…

Ht: {{s0,a0}west,{s1,a1}noord

,…,{st,at}oost }

Page 16: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 16

State transition on the observation state

State transition on the observation state in the game can be calculated by:

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa it

iit

it

Ssttttt KHxaPKHsPKHaxP

De kans op een bepaalde hand en uitgegooide kaarten op t+1 is:het produkt van de {som van de kans op alle mogelijke kaartverdelingen gegeven de historie op t en spelkennis K} met de {som van de producten van de kansen op alle mogelijke acties voor opponenten 1-3, gegeven elk hun strategie en de historie)

Page 17: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 17

State transition on the observation state

State transition on the observation state in the game can be calculated by:

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa it

iit

it

Ssttttt KHxaPKHsPKHaxP

De kans op een bepaalde hand en uitgegooide kaarten op t+1 is:het produkt van de {som van de kans op alle mogelijke kaartverdelingen gegeven de historie op t en spelkennis K} met de {som van de producten van de kansen op alle mogelijke acties voor opponenten 1-3, gegeven elk hun strategie en de historie)

Page 18: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 18

State transition on the observation state

State transition on the observation state can be calculated by:

Calculation is intractable Hearts has very huge state space. About states !2710

Need approximation

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa ii

iit

it

Ssttttt KHxaPKHsPKHaxP

Summation of all states … (?)….

Page 19: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 19

State transition on the observation state

State transition on the observation state about game of Hearts can be calculated by:

Calculation is intractable Hearts has very huge state space. About states ! 2710

Need approximation

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa ii

iit

it

Ssttttt KHxaPKHsPKHaxP

Summation of all states … (?)….aantal manieren om 52 kaarten over 4 spelers te verdelen zodat elk 13 kaarten heeft

Page 20: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 20

Mean-field approximation Calculate mean estimated observation state for the oppon

ent agent.

Een geschatte observatietoestand voor een opponent i is een gewogen som van de kans op observaties xt, gegeven een actie, een historie (en spelkennis K)

de (deel)kansen worden bekend gedurende het spel

itx

ttit

ittt

it KHaxPyKHay ),,|(),,(ˆ

Page 21: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 21

Mean-field approximation Calculate mean estimated observation state for

the opponent agent.

Transition probability is approximated as

itx

ttit

ittt

it KHaxPyKHay ),,|(),,(ˆ

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa ii

iit

it

Ssttttt KHxaPKHsPKHaxP

1

321 ),,(

3

11 )ˆ),,,(ˆ|(),,,|(

tttt Saaa i

itt

it

itttt KHayaPKHaxP

mean observation state

Page 22: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 22

Mean-field approximation Calculate mean estimated observation state for

the opponent agent.

Transition probability is approximated as

itx

ttit

ittt

it KHaxPyKHay ),,|(),,(ˆ

zodat de kansverdeling van de voorwaardelijke kans op een actie door opponent i kan worden bepaald, dwz: gegeven diens geschatte “unobservable state”

1

321 ),,(

3

11 )ˆ),,,(ˆ|(),,,|(

tttt Saaa i

itt

it

itttt KHayaPKHaxP

mean observation state

Page 23: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 23

Action control: TD Reinforcement Learning

An action is selected based on the expected TD error

where

Using the expected TD error, action selection probability is given by

)()()( 11 ttttttt xVaxVaxRa

1

)(),,,|()( 111

txtttttt xfKHaxPaxf

Aamkt

mtttt

k

Ta

TaxaP

)exp(

)exp()|(

Page 24: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 24

Action prediction

We use a function approximator (NGnet) for the utility function which is likely to be non-linear

Function approximators can be trained by using past games

1

321 ),,(

3

11 )ˆ),,,(ˆ|(),,,|(

tttt Saaa i

itt

it

itttt KHayaPKHaxP

Aa

iittt

it

i

iittt

it

ii

ttit

it

it

TaKHayU

TaKHayUKHayaP

))),,,(ˆ(exp(

))),,,(ˆ(exp()),,,(ˆ|(

iU

Page 25: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 25

Summary of proposed method

RL scheme based on Estimation of unobservable state variables Prediction of opponent agents’ actions

・・・

・・・

・・・ ・・・

・・・

・・・

)ˆ),,,(ˆ|( itt

it

it KHayaP

Estimation of unobservable state variables by mean-field approximation Learning agent determines its action based on prediction by environmental behaviors

Page 26: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

Computer simulations

Rule-based agent Single agent learning in a stationary

environment Learning by multiple agents in a multi-agent

environment

Page 27: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 27

Computer simulations Three experiments to evaluate learning agent by using a ru

le-based agent Single agent learning in a stationary environment

(A) learning agent, rule-based agent x3 Learning by multiple agents in a multi-agent environment

(B) learning agent, actor-critic agent, rule-based agent x2 (C) learning agent x2, rule-based agent x2

A rule-based agent has more than 50 rules, and it is an “experienced” level Hearts player.

Page 28: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 28

Proposed RL agent

Rule-based agent x3

Avera

ge p

en

alt

y

rati

o

Number of games

better player

Page 29: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 29Number of games

Avera

ge p

en

alt

y

rati

o

Proposed RL agent

Rule-based agent x2

Actor-critic agent

better player

Page 30: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 30

Avera

ge p

en

alt

y

rati

o

Number of games

Rule-based agent x2

Proposed RL agent x2

better player

Page 31: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

23/04/19 2003 IEEE International Conference on SMC 31

Summary We proposed a RL scheme for making an

autonomous learning agent that plays the multi-player card game “Hearts”.

Our RL agent estimates unobservable state variables using mean-field approximation, learns and predicts environmental behaviors.

Computer simulations showed our method is applicable to a realistic multi-agent problem.

Page 32: A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

NAra Institute of Science and Technology (NAIST)Hajime FUJITA

[email protected]://hawaii.aist-nara.ac.jp/~hajime-f/