A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

Post on 31-Dec-2015

32 views 7 download

description

A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP. 1,3. 1. 2. Hajime Fujita, Yoichiro Matsuno, and Shin Ishii. 1. Nara Institute of Science and Technology 2. Ricoh Co. Ltd. 3. CREST, Japan Science and Technology Corporation - PowerPoint PPT Presentation

Transcript of A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

A reinforcement learning schemefor a multi-agent card game:het leren van een POMDP

Hajime Fujita, Yoichiro Matsuno, and Shin Ishii1. Nara Institute of Science and Technology2. Ricoh Co. Ltd.3. CREST, Japan Science and Technology Corporation

Met aanpassingen door L. Schomaker tbv KI2

1 2 1,3

23/04/19 2003 IEEE International Conference on SMC 2

Contents Introduction Preparation

Card game “Hearts” Outline of our RL scheme

Proposed method State transition on the observation

state Mean-field approximation Action control Action predictor

Computer simulation results Summary

23/04/19 2003 IEEE International Conference on SMC 3

Background Games are well-defined test-beds for studying reinforce

ment learning (RL) schemes in various multi-agent environments

Black Jack (A.Perez-Uribe and A.Sanchez, 1998) Othello (T.Yoshioka, S.Ishii and M.Ito, 1999) Backgammon (G.Tesauro, 1994) ook: het spel GO, afstudeerproject Reindert-Jan Ekker

Completely observable problems

23/04/19 2003 IEEE International Conference on SMC 4

Background Games are well-defined test-beds for studying reinforce

ment learning (RL) schemes in various multi-agent environments

Black Jack (A.Perez-Uribe and A.Sanchez, 1998) Othello (T.Yoshioka, S.Ishii and M.Ito, 1999) Backgammon (G.Tesauro, 1994)

What about partially observable problems? estimate missing information? predict environmental behaviors?

Completely observable problems

23/04/19 2003 IEEE International Conference on SMC 5

Research field: Reinf. Learning RL scheme applicable to a multi-agent environment which

is partially observablepartially observable The card game “Hearts” (Hartenjagen)

Multi-agent (four players) environment Objective is well-defined

Partially Observable Markov Decision Process (POMDP) Cards in opponents’ hands are unobservable

Realistic problem Huge state space Number of unobservable variables is large. Competitive game with four agents

Challenging study

23/04/19 2003 IEEE International Conference on SMC 6

Card game “Hearts” Hearts is a 4-player game (multi-agent

environment). Each player has 13 cards at the beginning of the

game (partially observable) Each player plays a card clock-wise Particular cards have penalty points

Object : to score as few points as possible. Players must contrive strategies to avoid these

penalty cards (competitive situation)

13 penalty points 1 penalty point

23/04/19 2003 IEEE International Conference on SMC 7

Outline of learning scheme Agent (player) predicts opponents’ actions using

acquired environmental model

The next player will probably not discard a spade. So my best action

is …

23/04/19 2003 IEEE International Conference on SMC 8

Outline of learning scheme Agent (player) predicts opponents’ actions using

acquired environmental model

Computable by brute force?

The next player will probably not discard a spade. So my best action

is …

23/04/19 2003 IEEE International Conference on SMC 9

Outline of learning scheme Agent (player) predicts opponents’ actions using

acquired environmental model

Computable by brute force? No! size of search space

unknown utility of actions unknown opponent strategies

The next player will probably not discard a spade. So my best action

is …

23/04/19 2003 IEEE International Conference on SMC 10

Outline of Reinf. Learning scheme Agent (player) predicts opponents’ actions using

acquired environmental model

Predicted using acquired environmental model

The next player will probably not discard a spade. So my best action

is …

23/04/19 2003 IEEE International Conference on SMC 11

Outline of our RL scheme Agent (player) predicts opponents’ actions using

acquired environmental model

Predicted using acquired environmental model

.. (how?).. estimate unobservable part, reinforcement learning, simulated game training

The next player will probably not discard a spade. So my best action

is …

Proposed method

State transition on the observation state Mean-field approximation Action control Action predictor

23/04/19 2003 IEEE International Conference on SMC 13

State transition on the observation state

State transition on the observation state in the game can be calculated by:

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa ii

iit

it

Ssttttt KHxaPKHsPKHaxP

23/04/19 2003 IEEE International Conference on SMC 14

State transition on the observation state

State transition on the observation state in the game can be calculated by:

),,,|( 1 KHaxP ttt

x observation (cards in hand+cards on table)a action (card to be played)s state (all observable and onobservable cards)Ф strategies of each of the opponents Ht history of all x and a until time tK knowledge of the game

23/04/19 2003 IEEE International Conference on SMC 15

Voorbeelden a: “harten-2 opgooien” s:

[niet observeerbaar deel] Oost heeft kaarten u,v,w,…,z West heeft kaarten a,b,… Noord heeft kaarten r,s,…

[observeerbaar deel= x] Ik heb kaarten f,g,… op tafel liggen kaarten k,l,…

Ht: {{s0,a0}west,{s1,a1}noord

,…,{st,at}oost }

23/04/19 2003 IEEE International Conference on SMC 16

State transition on the observation state

State transition on the observation state in the game can be calculated by:

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa it

iit

it

Ssttttt KHxaPKHsPKHaxP

De kans op een bepaalde hand en uitgegooide kaarten op t+1 is:het produkt van de {som van de kans op alle mogelijke kaartverdelingen gegeven de historie op t en spelkennis K} met de {som van de producten van de kansen op alle mogelijke acties voor opponenten 1-3, gegeven elk hun strategie en de historie)

23/04/19 2003 IEEE International Conference on SMC 17

State transition on the observation state

State transition on the observation state in the game can be calculated by:

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa it

iit

it

Ssttttt KHxaPKHsPKHaxP

De kans op een bepaalde hand en uitgegooide kaarten op t+1 is:het produkt van de {som van de kans op alle mogelijke kaartverdelingen gegeven de historie op t en spelkennis K} met de {som van de producten van de kansen op alle mogelijke acties voor opponenten 1-3, gegeven elk hun strategie en de historie)

23/04/19 2003 IEEE International Conference on SMC 18

State transition on the observation state

State transition on the observation state can be calculated by:

Calculation is intractable Hearts has very huge state space. About states !2710

Need approximation

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa ii

iit

it

Ssttttt KHxaPKHsPKHaxP

Summation of all states … (?)….

23/04/19 2003 IEEE International Conference on SMC 19

State transition on the observation state

State transition on the observation state about game of Hearts can be calculated by:

Calculation is intractable Hearts has very huge state space. About states ! 2710

Need approximation

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa ii

iit

it

Ssttttt KHxaPKHsPKHaxP

Summation of all states … (?)….aantal manieren om 52 kaarten over 4 spelers te verdelen zodat elk 13 kaarten heeft

23/04/19 2003 IEEE International Conference on SMC 20

Mean-field approximation Calculate mean estimated observation state for the oppon

ent agent.

Een geschatte observatietoestand voor een opponent i is een gewogen som van de kans op observaties xt, gegeven een actie, een historie (en spelkennis K)

de (deel)kansen worden bekend gedurende het spel

itx

ttit

ittt

it KHaxPyKHay ),,|(),,(ˆ

23/04/19 2003 IEEE International Conference on SMC 21

Mean-field approximation Calculate mean estimated observation state for

the opponent agent.

Transition probability is approximated as

itx

ttit

ittt

it KHaxPyKHay ),,|(),,(ˆ

1

321 ),,(

3

11 ),,,|(),|(),,,|(

tttttt Saaa ii

iit

it

Ssttttt KHxaPKHsPKHaxP

1

321 ),,(

3

11 )ˆ),,,(ˆ|(),,,|(

tttt Saaa i

itt

it

itttt KHayaPKHaxP

mean observation state

23/04/19 2003 IEEE International Conference on SMC 22

Mean-field approximation Calculate mean estimated observation state for

the opponent agent.

Transition probability is approximated as

itx

ttit

ittt

it KHaxPyKHay ),,|(),,(ˆ

zodat de kansverdeling van de voorwaardelijke kans op een actie door opponent i kan worden bepaald, dwz: gegeven diens geschatte “unobservable state”

1

321 ),,(

3

11 )ˆ),,,(ˆ|(),,,|(

tttt Saaa i

itt

it

itttt KHayaPKHaxP

mean observation state

23/04/19 2003 IEEE International Conference on SMC 23

Action control: TD Reinforcement Learning

An action is selected based on the expected TD error

where

Using the expected TD error, action selection probability is given by

)()()( 11 ttttttt xVaxVaxRa

1

)(),,,|()( 111

txtttttt xfKHaxPaxf

Aamkt

mtttt

k

Ta

TaxaP

)exp(

)exp()|(

23/04/19 2003 IEEE International Conference on SMC 24

Action prediction

We use a function approximator (NGnet) for the utility function which is likely to be non-linear

Function approximators can be trained by using past games

1

321 ),,(

3

11 )ˆ),,,(ˆ|(),,,|(

tttt Saaa i

itt

it

itttt KHayaPKHaxP

Aa

iittt

it

i

iittt

it

ii

ttit

it

it

TaKHayU

TaKHayUKHayaP

))),,,(ˆ(exp(

))),,,(ˆ(exp()),,,(ˆ|(

iU

23/04/19 2003 IEEE International Conference on SMC 25

Summary of proposed method

RL scheme based on Estimation of unobservable state variables Prediction of opponent agents’ actions

・・・

・・・

・・・ ・・・

・・・

・・・

)ˆ),,,(ˆ|( itt

it

it KHayaP

Estimation of unobservable state variables by mean-field approximation Learning agent determines its action based on prediction by environmental behaviors

Computer simulations

Rule-based agent Single agent learning in a stationary

environment Learning by multiple agents in a multi-agent

environment

23/04/19 2003 IEEE International Conference on SMC 27

Computer simulations Three experiments to evaluate learning agent by using a ru

le-based agent Single agent learning in a stationary environment

(A) learning agent, rule-based agent x3 Learning by multiple agents in a multi-agent environment

(B) learning agent, actor-critic agent, rule-based agent x2 (C) learning agent x2, rule-based agent x2

A rule-based agent has more than 50 rules, and it is an “experienced” level Hearts player.

23/04/19 2003 IEEE International Conference on SMC 28

Proposed RL agent

Rule-based agent x3

Avera

ge p

en

alt

y

rati

o

Number of games

better player

23/04/19 2003 IEEE International Conference on SMC 29Number of games

Avera

ge p

en

alt

y

rati

o

Proposed RL agent

Rule-based agent x2

Actor-critic agent

better player

23/04/19 2003 IEEE International Conference on SMC 30

Avera

ge p

en

alt

y

rati

o

Number of games

Rule-based agent x2

Proposed RL agent x2

better player

23/04/19 2003 IEEE International Conference on SMC 31

Summary We proposed a RL scheme for making an

autonomous learning agent that plays the multi-player card game “Hearts”.

Our RL agent estimates unobservable state variables using mean-field approximation, learns and predicts environmental behaviors.

Computer simulations showed our method is applicable to a realistic multi-agent problem.

NAra Institute of Science and Technology (NAIST)Hajime FUJITA

hajime-f@is.aist-nara.ac.jphttp://hawaii.aist-nara.ac.jp/~hajime-f/