Grenoble
-
Upload
olivier-teytaud -
Category
Technology
-
view
287 -
download
0
Transcript of Grenoble
Monte-Carlo Tree Search
Games with partial observation
[email protected] + David Auger+Herv Fournier + Fabien Teytaud + Sbastien Flory+ JY Audibert+ S. Bubeck + R. Munos + ...Includes Inria, Cnrs, Univ. Paris-Sud, LRI, CMAP, Taiwan universities, Lille, Paris, Boostr...
TAO, Inria-Saclay IDF, Cnrs 8623, Lri, Univ. Paris-Sud,Digiteo Labs, PascalNetwork of Excellence.
GrenobleJune 2011
Monte-Carlo Tree Search
1. Games (a bit of formalism)
2. Hidden information SA
3. Decidability / complexity
4. Real implementation==> appli to UrbanRivals
A game is a directed graph
A game is a directed graph with actions
1
2
3
A game is a directed graph with actions
and players
1
2
3
Black
White
White
White
Black
Black
Black
Black
12
43
A game is a directed graph with actions
and players and observations
1
2
3
Black
White
White
White
Black
Black
Black
Black
12
43
BeeBobBeeBearBlack
Black
A game is a directed graph with actions
and players and observations and rewards
+1
0
1
2
3
Black
White
White
White
Black
Black
Black
Black
12
43
BeeBobBeeBearBlack
Black
Rewardson leafsonly!
A game is a directed graph +actions
+players +observations +rewards +loops
+1
0
1
2
3
Black
White
White
White
Black
Black
Black
Black
12
43
BeeBobBeeBearBlack
Black
Monte-Carlo Tree Search
1. Games (a bit of formalism)
2. Hidden information SA
3. Decidability / complexity
4. Real implementation
A game is a directed graph +actions
+players +observations +rewards +loops
+1
0
1
2
3
Black
White
White
White
Black
Black
Black
Black
12
43
BeeBobBeeBearBlack
Black
Consider games as follows:
Turn 1
Turn 2
Turn K: all information is revealed.
Turn K+1
Turn K+2
Turn 2K: all information is revealed
Turn NK: all information is revealed
A game is a directed graph +actions
+players +observations +rewards +loops
+1
0
1
2
3
Black
White
White
White
Black
Black
Black
Black
12
43
BeeBobBeeBearBlack
Black
Rewrite it as follows:
Turn 1: player 1 chooses (privately) his strategy until turn K
Turn 2: player 2 chooses (privately) his strategy until turn K
Intermediate turns removed!
Turn K: all information is revealed.
Turn K+1
Turn K+2
Turn 2K: all information is revealed
Turn NK: all information is revealed
A game is a directed graph +actions
+players +observations +rewards +loops
+1
0
1
2
3
Black
White
White
White
Black
Black
Black
Black
12
43
BeeBobBeeBearBlack
Black
Rewrite it as follows:
Turn 1: player 1 chooses (privately) his strategy until turn K
Turn 2: player 2 chooses (privately) his strategy until turn K
Intermediate turns removed!
Turn K: all information is revealed.
Turn K+1
Turn K+2
Turn 2K: all information is revealed
Turn NK: all information is revealed
Equivalenttosimultaneousactions
A game is a directed graph +actions
+players +observations +rewards +loops
+1
0
1
2
3
Black
White
White
White
Black
Black
Black
Black
12
43
BeeBobBeeBearBlack
Black
Now it's a game with simultaneous informationand no hidden information.
Simultaneous actions = (almost)short term hidden information.
Monte-Carlo Tree Search
1. Games (a bit of formalism)
2. Hidden information but no sure win.
==> the UD question is not relevant here!1 0 0 0 0 0 0 1 0 0 0 00 0 1 0 0 0 0 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1
Complexity question for phantom-games ?
This is phantom-go.
Good for black: wins with proba 1-1/(8!)
Here, there's no move which ensures a win.
But some moves are much better than others!Joint work withF. Teytaud
It becomes complicated
Isn't it possible to considera better question ?
Complexity (2P, no random)
X= proba(winning) that we look for
Unbounded Exponential Polynomial horizon horizon horizon
FullObservability EXP EXP PSPACE
No obs EXPSPACE NEXP(X=100%) (Hasslum et al, 2000)
Partially 2EXP EXPSPACE Observable (Rintanen) (Mundhenk)(X=100%)
Simult. Actions ? EXPSPACE ? P2 can check the consistency of one 3-uple per line
==> requests space log(N) ( = position of the 3-uple)
EXPSPACE-complete PO games
The one-player PO case is EXPSPACE-complete (games in succinct form).
2EXPTIME-complete PO games
The two-player PO case is 2EXP-complete (games in succinct form).
Undecidable games (B. Hearn)
The three-player PO case is undecidable. (two players against one,not allowed to communicate)
Complexity (2P, no random)
Unbounded Exponential Polynomial horizon horizon horizon
FullObservability EXP EXP PSPACE
No obs EXPSPACE NEXP(X=100%) (Hasslum et al, 2000)
Partially 2EXP EXPSPACE Observable (Rintanen 97)(X=100%)
Simult. Actions ? EXPSPACE ? 0.6 or not just a subtle precision trouble.
Monte-Carlo Tree Search
1. Games (a bit of formalism)
2. Hidden information SA
3. Decidability / complexity
4. Real implementation
Real implementation for simultaneous action ?
MCTS principle
But with EXP3 in nodes.
Coulom (06)Chaslot, Saito & Bouzy (06)Kocsis Szepesvari (06)UCT (Upper Confidence Trees)
UCT
UCT
UCT
UCT
UCT
Kocsis & Szepesvari (06)
Exploitation ...
Exploitation ...
SCORE = 5/7 + k.sqrt( log(10)/7 )
Exploitation ...
SCORE = 5/7 + k.sqrt( log(10)/7 )
Exploitation ...
SCORE = 5/7 + k.sqrt( log(10)/7 )
... or exploration ?
SCORE = 0/2 + k.sqrt( log(10)/2 )
Replace itwith EXP3 / INF
The game of Go is a part of AI.Computers are ridiculous in front of children.
Easy situation.Termed semeai.Requires a little bitof abstraction.
The game of Go is a part of AI.Computers are ridiculous in front of children.
800 cores, 4.7 GHz,top level program.
Plays a stupid move.
The game of Go is a part of AI.Computers are ridiculous in front of children.
8 years old; little training;finds the good move
MoGo(TW): games vs pros
in the game of Go
First win in 9x9
First draw (a few days ago!) over 6 games
First win over 4 games in 9x9 blind Go
First win with H2.5 in 13x13 Go
First win with H6 in 19x19 Go in 2009 (also done by Zen)
First win with H7 in 19x19 Go vs top pro in 2009 (also done by Pachi in 2011)
Monte-Carlo Tree Search
1. Games (a bit of formalism)
2. Hidden information SA
3. Decidability / complexity
4. Real implementation ==> Dark Chess endgames==> appli to UrbanRivals
Let's have fun with Urban Rivals (4 cards)
Each player has - four cards (each one can be used once) - 12 pilz (each one can be used once) - 12 life points
Each card has: - one attack level - one damage - special effects (forget it for the moment)
Four turns: - P1 attacks P2 - P2 attacks P1 - P1 attacks P2 - P2 attacks P1
Let's have fun with Urban Rivals
First, attacker plays: - chooses a card - chooses ( ) a number of pilz Attack level = attack(card) x (1+nb of pilz)
Then, defender plays: - chooses a card - chooses a number of pilz Defense level = attack(card) x (1+nb of pilz)
Result: If attack > defense Defender looses Power(attacker's card) Else Attacker looses Power(defender's card)
PRIVATELY
Let's have fun with Urban Rivals
==> The MCTS-based AI is now at the best human level.
Experimental (only) remarks on EXP3:
- discard strategies with small number of sims = better approx of the Nash
- also an improvement by taking into account the other bandit
- not yet compared to INF
- virtual simulations (inspired by Kummer)
Conclusions
New stuff:Undecidability of optimal play for 2-player games with hidden information
Transformation PO periodically revealed ==> simultaneous action game with full observation
Open problemsComplexity: simultaneous action and infinite horizon (in progress)
Complexity with PO: same information for both cases ?
Nash of matrix games with strong dominance
Mathematical validation of variants of Exp3 / Inf
Consistent realistic approaches for PO games (H finite)
Conclusions
New stuff:Undecidability of optimal play for 2-player games with hidden information
Transformation PO periodically revealed ==> simultaneous action game with full observation
Open problemsComplexity: simultaneous action and infinite horizon (in progress)
Complexity with PO: same information for both cases ?
Nash of matrix games with strong dominance
Mathematical validation of variants of Exp3 / Inf
Consistent realistic approaches for PO games (H finite)
Conclusions
New stuff:Undecidability of optimal play for 2-player games with hidden information
Transformation PO periodically revealed ==> simultaneous action game with full observation
Open problemsComplexity: simultaneous action and infinite horizon (in progress)
Complexity with PO: same information for both cases ?
Nash of matrix games with strong dominance
Mathematical validation of variants of Exp3 / Inf
Consistent realistic approaches for PO games (H finite)
Conclusions
New stuff:Undecidability of optimal play for 2-player games with hidden information
Transformation PO periodically revealed ==> simultaneous action game with full observation
Open problemsComplexity: simultaneous action and infinite horizon (in progress)
Complexity with PO: same information for both players ?
Nash of matrix games with strong dominance
Mathematical validation of variants of Exp3 / Inf
Consistent realistic approaches for PO games (H finite)
Conclusions
New stuff:Undecidability of optimal play for 2-player games with hidden information
Transformation PO periodically revealed ==> simultaneous action game with full observation
Open problemsComplexity: simultaneous action and infinite horizon (in progress)
Complexity with PO: same information for both players ?
Nash of matrix games with strong dominance
Mathematical validation of variants of Exp3 / Inf
Consistent realistic approaches for PO games (H finite)
Conclusions
New stuff:Undecidability of optimal play for 2-player games with hidden information
Transformation PO periodically revealed ==> simultaneous action game with full observation
Open problemsComplexity: simultaneous action and infinite horizon (in progress)
Complexity with PO: same information for both players ?
Nash of matrix games with strong dominance
Mathematical validation of variants of Exp3 / Inf
Consistent realistic approaches for PO games (H finite)
Conclusions
New stuff:Undecidability of optimal play for 2-player games with hidden information
Transformation PO periodically revealed ==> simultaneous action game with full observation
Open problemsComplexity: simultaneous action and infinite horizon (in progress)
Complexity with PO: same information for both players ?
Nash of matrix games with strong dominance
Mathematical validation of variants of Exp3 / Inf
Consistent realistic approaches for PO games (H finite)
When is MCTS relevant ?
Robust in front of:High dimension;
Non-convexity of Bellman values;
Complex models
Delayed reward
Simultaneous actions
More difficult forHigh values of H;
Highly unobservable cases (Monte-Carlo, but not Monte-Carlo Tree Search, see Cazenave et al.)
Lack of reasonable baseline for the MC
When is MCTS relevant ?
Robust in front of:High dimension;
Non-convexity of Bellman values;
Complex models
Delayed reward
Simultaneous actions
More difficult forHigh values of H;
Highly unobservable cases (Monte-Carlo, but not Monte-Carlo Tree Search, see Cazenave et al.)
Lack of reasonable baseline for the MC
Furtherwork !SomeundecidabilityresultsWe shouldtest INF andjustify mathematicallyour improvements
When is MCTS relevant ?
How to apply it:Implement the transition
(a function action x state state )
Design a Monte-Carlo part (a random simulation)
(a heuristic in one-player games; difficult if two opponents)
==> at this point you can simulate...
Implement UCT (just a bias in the simulator no real optimizer)
Possibly parallelize (Gelly et al)
Convenient. Easy to check.
PO problems, approx. Nash ==> mailing list
Challenge: outperform humansin Urban Rivals- free game- fast games (~ 1 minute)- 11M registered players
Click to edit the title text format
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level