Grenoble

Monte-Carlo Tree Search

Games with partial observation

[email protected] + David Auger+Herv Fournier + Fabien Teytaud + Sbastien Flory+ JY Audibert+ S. Bubeck + R. Munos + ...Includes Inria, Cnrs, Univ. Paris-Sud, LRI, CMAP, Taiwan universities, Lille, Paris, Boostr...

TAO, Inria-Saclay IDF, Cnrs 8623, Lri, Univ. Paris-Sud,Digiteo Labs, PascalNetwork of Excellence.

GrenobleJune 2011


1. Games (a bit of formalism)

2. Hidden information SA

3. Decidability / complexity

4. Real implementation==> appli to UrbanRivals

A game is a directed graph

A game is a directed graph with actions

1

2

3

and players

1

2

3

Black

White

White

White

Black

Black

Black

Black

12

43

and players and observations

1

2

3

Black

White

White

White

Black

Black

Black

Black

12

43

BeeBobBeeBearBlack

Black

and players and observations and rewards

+1

0

1

2

3

Black

White

White

White

Black

Black

Black

Black

12

43

BeeBobBeeBearBlack

Black

Rewardson leafsonly!

A game is a directed graph +actions
+players +observations +rewards +loops

+1

0

1

2

3

Black

White

White

White

Black

Black

Black

Black

12

43

BeeBobBeeBearBlack

Black





4. Real implementation


+1

0

1

2

3

Black

White

White

White

Black

Black

Black

Black

12

43

BeeBobBeeBearBlack

Black

Consider games as follows:

Turn 1

Turn 2

Turn K: all information is revealed.

Turn K+1

Turn K+2

Turn 2K: all information is revealed

Turn NK: all information is revealed


+1

0

1

2

3

Black

White

White

White

Black

Black

Black

Black

12

43

BeeBobBeeBearBlack

Black

Rewrite it as follows:

Turn 1: player 1 chooses (privately) his strategy until turn K


Intermediate turns removed!


Turn K+1

Turn K+2




+1

0

1

2

3

Black

White

White

White

Black

Black

Black

Black

12

43

BeeBobBeeBearBlack

Black

Rewrite it as follows:



Intermediate turns removed!


Turn K+1

Turn K+2



Equivalenttosimultaneousactions


+1

0

1

2

3

Black

White

White

White

Black

Black

Black

Black

12

43

BeeBobBeeBearBlack

Black

Now it's a game with simultaneous informationand no hidden information.

Simultaneous actions = (almost)short term hidden information.



2. Hidden information but no sure win.

==> the UD question is not relevant here!1 0 0 0 0 0 0 1 0 0 0 00 0 1 0 0 0 0 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1

Complexity question for phantom-games ?

This is phantom-go.

Good for black: wins with proba 1-1/(8!)

Here, there's no move which ensures a win.

But some moves are much better than others!Joint work withF. Teytaud

It becomes complicated

Isn't it possible to considera better question ?

Complexity (2P, no random)
X= proba(winning) that we look for

Unbounded Exponential Polynomial horizon horizon horizon

FullObservability EXP EXP PSPACE

No obs EXPSPACE NEXP(X=100%) (Hasslum et al, 2000)

Partially 2EXP EXPSPACE Observable (Rintanen) (Mundhenk)(X=100%)

Simult. Actions ? EXPSPACE ? P2 can check the consistency of one 3-uple per line

==> requests space log(N) ( = position of the 3-uple)

EXPSPACE-complete PO games

The one-player PO case is EXPSPACE-complete (games in succinct form).

2EXPTIME-complete PO games

The two-player PO case is 2EXP-complete (games in succinct form).

Undecidable games (B. Hearn)

The three-player PO case is undecidable. (two players against one,not allowed to communicate)

Complexity (2P, no random)

Unbounded Exponential Polynomial horizon horizon horizon

FullObservability EXP EXP PSPACE

No obs EXPSPACE NEXP(X=100%) (Hasslum et al, 2000)

Partially 2EXP EXPSPACE Observable (Rintanen 97)(X=100%)

Simult. Actions ? EXPSPACE ? 0.6 or not just a subtle precision trouble.





4. Real implementation

Real implementation for simultaneous action ?

MCTS principle

But with EXP3 in nodes.

Coulom (06)Chaslot, Saito & Bouzy (06)Kocsis Szepesvari (06)UCT (Upper Confidence Trees)

UCT

UCT

UCT

UCT

UCT

Kocsis & Szepesvari (06)

Exploitation ...

Exploitation ...

SCORE = 5/7 + k.sqrt( log(10)/7 )

Exploitation ...

SCORE = 5/7 + k.sqrt( log(10)/7 )

Exploitation ...

SCORE = 5/7 + k.sqrt( log(10)/7 )

... or exploration ?

SCORE = 0/2 + k.sqrt( log(10)/2 )

Replace itwith EXP3 / INF

The game of Go is a part of AI.Computers are ridiculous in front of children.

Easy situation.Termed semeai.Requires a little bitof abstraction.


800 cores, 4.7 GHz,top level program.

Plays a stupid move.


8 years old; little training;finds the good move

MoGo(TW): games vs pros
in the game of Go

First win in 9x9

First draw (a few days ago!) over 6 games

First win over 4 games in 9x9 blind Go

First win with H2.5 in 13x13 Go

First win with H6 in 19x19 Go in 2009 (also done by Zen)

First win with H7 in 19x19 Go vs top pro in 2009 (also done by Pachi in 2011)





4. Real implementation ==> Dark Chess endgames==> appli to UrbanRivals

Let's have fun with Urban Rivals (4 cards)

Each player has - four cards (each one can be used once) - 12 pilz (each one can be used once) - 12 life points

Each card has: - one attack level - one damage - special effects (forget it for the moment)

Four turns: - P1 attacks P2 - P2 attacks P1 - P1 attacks P2 - P2 attacks P1

Let's have fun with Urban Rivals

First, attacker plays: - chooses a card - chooses ( ) a number of pilz Attack level = attack(card) x (1+nb of pilz)

Then, defender plays: - chooses a card - chooses a number of pilz Defense level = attack(card) x (1+nb of pilz)

Result: If attack > defense Defender looses Power(attacker's card) Else Attacker looses Power(defender's card)

PRIVATELY

Let's have fun with Urban Rivals

==> The MCTS-based AI is now at the best human level.

Experimental (only) remarks on EXP3:

- discard strategies with small number of sims = better approx of the Nash

- also an improvement by taking into account the other bandit

- not yet compared to INF

- virtual simulations (inspired by Kummer)

Conclusions

New stuff:Undecidability of optimal play for 2-player games with hidden information

Transformation PO periodically revealed ==> simultaneous action game with full observation

Open problemsComplexity: simultaneous action and infinite horizon (in progress)

Complexity with PO: same information for both cases ?

Nash of matrix games with strong dominance

Mathematical validation of variants of Exp3 / Inf

Consistent realistic approaches for PO games (H finite)

Conclusions








Conclusions








Conclusions




Complexity with PO: same information for both players ?




Conclusions








Conclusions








Conclusions








When is MCTS relevant ?

Robust in front of:High dimension;

Non-convexity of Bellman values;

Complex models

Delayed reward

Simultaneous actions

More difficult forHigh values of H;

Highly unobservable cases (Monte-Carlo, but not Monte-Carlo Tree Search, see Cazenave et al.)

Lack of reasonable baseline for the MC


Robust in front of:High dimension;

Non-convexity of Bellman values;

Complex models

Delayed reward

Simultaneous actions

More difficult forHigh values of H;

Highly unobservable cases (Monte-Carlo, but not Monte-Carlo Tree Search, see Cazenave et al.)

Lack of reasonable baseline for the MC

Furtherwork !SomeundecidabilityresultsWe shouldtest INF andjustify mathematicallyour improvements


How to apply it:Implement the transition

(a function action x state state )

Design a Monte-Carlo part (a random simulation)

(a heuristic in one-player games; difficult if two opponents)

==> at this point you can simulate...

Implement UCT (just a bias in the simulator no real optimizer)

Possibly parallelize (Gelly et al)

Convenient. Easy to check.

PO problems, approx. Nash ==> mailing list

Challenge: outperform humansin Urban Rivals- free game- fast games (~ 1 minute)- 11M registered players

Click to edit the title text format

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level

Grenoble

Technology

Transcript of Grenoble