Sequential decision making:
decidability and complexity
Games with partial observation
[email protected] + too many people for being all cited. Includes Inria, Cnrs, Univ. Paris-Sud, LRI, Taiwan universities (including NUTN), CITINES project
TAO, Inria-Saclay IDF, Cnrs 8623, Lri, Univ. Paris-Sud,Digiteo Labs, PascalNetwork of Excellence.
ParisSeptember 2012.
A quite general model
A directed graph (finite).A Starting point on the graph, a target (or several targets, with different rewards).I want to reach a target.
Labels(=decisions) on edges:Next node = f( current node, decision)
Each node is either:- random node (random decision).- decision node (I choose a decision)- opponent node (an opponent chooses)
Partial observation
Each decision nodeis equipped with an observation;you can make decisions usingthe list of past observations
==> you don't know where you are in the graph
Overview
10%: overview of Alternating Turing machine & computational complexity (great tool for complexity upper bounds)
50%: general culture on games (including undecidability)
35%: general culture on fictitious play (matrix games) (probably no time for this...)
4%: my results on that stuff ==> 2 detailed proofs (one new) ==> feel free of interrupting
Outline
Complexity and ATM
Complexity and games (incl. planning)
Bounded horizon games
Classical complexity classes
P NP PSPACE EXPTIME NEXPTIME EXPSPACE
Proved:PSPACE EXPSPACE P EXPTIMENP NEXPTIME
Believed, not proved:PNP EXPTIMENEXPTIMENEXPTIMEEXPSPACE
Complexity and alternating Turing machines
Turing machine (TM)= abstract computer
Non-deterministic Turing Machine (NTM) = TM with for all states (i.e. several transitions, accepts if all transitions accept)
Co-NTM: TM with exists states (i.e. several transitions, accepts if at least one transition accepts)
ATM: TM with both exists and for all states.
Complexity and alternating Turing machines
Turing machine (TM)= abstract computer
Non-deterministic Turing Machine (NTM) = TM with exists states (i.e. several transitions, accepts if at least one accepts)
Co-NTM: TM with exists states (i.e. several transitions, accepts if at least one transition accepts)
ATM: TM with both exists and for all states.
Complexity and alternating Turing machines
Turing machine (TM)= abstract computer
Non-deterministic Turing Machine (NTM) = TM with exists states (i.e. several transitions, accepts if at least one accepts)
Co-NTM: TM with for all states (i.e. several transitions, accepts if all lead to accept)
ATM: TM with both exists and for all states.
Complexity and alternating Turing machines
Turing machine (TM)= abstract computer
Non-deterministic Turing Machine (NTM) = TM with exists states (i.e. several transitions, accepts if at least one accepts)
Co-NTM: TM with for all states (i.e. several transitions, accepts if all lead to accept)
ATM: TM with both exists and for all states.
Alternation
Non-determinism & alternation
Outline
Complexity and ATM
Complexity and games (incl. planning)
Bounded horizon games
Computational complexity: framework
Uncertainty can be:Adversarial: I focus on worst case
Stochastic: I focus on average result
Or both.
Stochastic = adversarial if goal = 100% success.Stochastic != adversarial in the general case.
Computational complexity: framework
Many representations for problems. E.g.:Succinct: a circuit computes the ith bit of the proba that action a leads to a transition from s to s'
Compressed: a circuit computes many bits simultaneously
Flat: longer encoding (transition tables)
==> does not matter for decidability==> matters for complexity
Computational complexity: framework
Many representations for problems. E.g.:Succinct
Compressed
Flat
Compressed representation somehow natural (state space has exponential size, transitions are fast): see e.g. Mundhenk for detailed defs and flat representations.
Computational complexity: framework
We use mainly compressed representation; see also Mundhenk for flat representations.
Typically, exponentially small representations lead to exponentially higher complexity==> but it's not always the case...
Simple things can change a lot the complexity:superko: rules forbid twice the same position; some fully observable 2Player games become EXPSPACE instead of EXP ==> discussed later
Computational complexity: framework for first tables of results
Either search (find a target) or optimize (cumulate rewards over time)
Compressed (written with circuits or others...)or not (flat).
Horizon:- Short horizon: horizon size of input- Long horizon: log2(horizon) size of input- Infinite horizon: no limit
Mundhenk's summary: one player, limited horizon: expected reward >0 ?
Mundhenk's summary: one player, non-negative reward, looking for non-neg. average reward (= positive proba of reaching): easier
Complexity, partial observation, infinite horizon, proba of reaching a target
1P+random, unobservable: undecidable (Madani et al)
1P+random, P(win=1), or equivalently 2P, P(win=1): [Rintanen and refs therein]Fully observable: EXP [Littman94]
Unobservable: EXPSPACE [Hasslum et al 2000]
Partial observability: 2EXP [rintanen, 2003]
Rmk: 2P, P(win=1) is not 2P!
Complexity, partial observation, infinite horizon
2P vs 1P,P(win)=1?:undecidable![Hearn, Demaine]
2P (random or not): Existence of sure win: equiv. to 1P+random !EXP full-observable (e.g. Go, Robson 1984)
PSPACE unobservable
2EXP partially observable
Existence of sure win, same state forbidden: EXPSPACE-complete (Go with Chinese rules ? rather conjectured EXPTIME or PSPACE...)
General case (optimal play): undecidable (Auger, Teytaud) (what about phantom-Go ?)
Complexity, partial observation
Remarks:Continuous case ?
Purely epistemic (we gather information, we don't change the state) ? [Sabbadin et al]
Restrictions on the policy, on the set of actions...
Discounted reward
DEC-POMDP, POSG : many players, same/opposite/different reward functions...
What are the approaches ?
Dynamic programming (Mass Bellman 50's) (still the main approach in industry), alpha-beta, retrograde analysis
Reinforcement learning
MCTS (R. Coulom. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search. In Proceedings of the 5th International Conference on Computers and Games, Turin, Italy, 2006)
Scripts + Tuning / Direct Policy Search
Coevolution
All have their PO extensions but the two last are the most convenient in this case.
Partially observable games
Many tools for fully observable games.Not so many for partially observable ones.
Shi-Fu-Mi (Rock Paper Scissor)
Card games
Phantom games
Shi-Fu-Mi (Rock-Paper-Scissors)
Fully observable in simultaneous play, but partially observable in turn-based version.
Computers stronger than humans (yes, it's true).
Card games, phantom games
Phantomized version of a game:You don't see the move of your opponents
If you play an illegal move, you are informed that it's illegal, you play again
Usually, you get a few more information (captures, threats...) Dark Chess: more info
phantom-Go
etc.
Partially observable games
Usually quite heuristic algorithms
Best performing algorithms combine:Opponent modelling (as for Shi-Fu-Mi)
Belief state (often by Monte-Carlo simulations)
Not a lot of tree search
A lot of tuning==> usually no consistency analysis
Part I: Complexity analysis (unbounded horizon)
Game:One or two players
Win, loss, draw (incl. endless loop)
Partial observability, no random part
Finite state space:state=transition(state,action)
action decided by each player in turn
State of the art
- makes sense in fully observable games- not so much in non-observable games
State of the art
EXPTIME-complete in the generalfully-observable case
EXPTIME-complete fully observable games
- Chess (for some nxn generalization)
- Go (with no superko)
- Draughts (international or english)
- Chinese checkers
- Shogi
PSPACE-complete fully observable games
- Amazons- Hex- Go-moku- Connect-6- Qubic- Reversi- Tic-Tac-ToeMany games with filling of each cell once and only oncepolynomial horizon+full observation==> PSPACE
EXPSPACE-complete unobservable games (Hasslun & Jonnsson)
The two-player unobservable case is EXPSPACE-complete (games in succinct form, infinite horizon).
(still for 100%win UD criterion - for not fully observable cases it is necessary to be precise...)
Importantly, the UD criterion means that strategiesare the same if the opponent has full observation as if he has no observation ==> UD is very bad :-(
EXPSPACE-complete unobservable games (Hasslun & Jonnsson)
The two-player unobservable case is EXPSPACE-complete (games in succinct form).PROOF: (I) First note that strategies are just sequences of actions (no observability!) (II) It is in EXPSPACE=NEXPSPACE, because of the following algorithm: (a) Non-deterministically choose the sequence of Actions (b) Check the result against all possible strategies (III) We have to check the hardness only.
EXPSPACE-complete unobservable games (Hasslun & Jonnsson)
The two-player unobservable case is EXPSPACE-complete (games in succinct form).PROOF: (I) First note that strategies are just sequences of actions (no observability!) (II) It is in EXPSPACE=NEXPSPACE, because of the following algorithm: (a) Non-deterministically choose the sequence of actions (exponential list of actions is enough...) (b) Check the result against all possible strategies (III) We have to check the hardness only.
EXPSPACE-complete unobservable games (Hasslun & Jonnsson)
The two-player unobservable case is EXPSPACE-complete (games in succinct form).PROOF: (I) First note that strategies are just sequences of actions (no observability!) (II) It is in EXPSPACE=NEXPSPACE, because of the following algorithm: (a) Non-deterministically choose the sequence of actions (b) Check the result against all possible strategies (III) We have to check the hardness only.
EXPSPACE-complete unobservable games (Hasslun & Jonnsson)
The two-player unobservable case is EXPSPACE-complete (games in succinct form).PROOF of the hardness: Reduction to: is my TM with exponential tape going to halt ?
Consider a TM with tape of size N=2^n.
We must find a game - with size n ( n= log2(N) )- such that the first player has a winning strategy for player 1 iff the TM halts.
EXPSPACE-complete unobservable games (Hasslun & Jonnsson)
Encoding a Turing machine with a tape of size Nas a game with state O(log(N))Player 1 chooses the sequence of configurations of the tape (N=4):
x(0,1),x(0,2),x(0,3),x(0,4) ==> initial statex(1,1),x(1,2),x(1,3),x(1,4) x(2,1),x(2,2),x(2,3),x(2,4) x(3,1),x(3,2),x(3,3),x(3,4) .....................................
EXPSPACE-complete unobservable games (Hasslun & Jonnsson)
Encoding a Turing machine with a tape of size Nas a game with state O(log(N))Player 1 chooses the sequence of configurations of the tape (N=4):
x(0,1),x(0,2),x(0,3),x(0,4) ==> initial statex(1,1),x(1,2),x(1,3),x(1,4) x(2,1),x(2,2),x(2,3),x(2,4) x(3,1),x(3,2),x(3,3),x(3,4) ..................................... x(N,1), x(N,2), x(N,3), x(N,4)
Wins byfinal state !
EXPSPACE-complete unobservable games (Hasslun & Jonnsson)
Encoding a Turing machine with a tape of size Nas a game with state O(log(N))Player 1 chooses the sequence of configurations of the tape (N=4):
x(0,1),x(0,2),x(0,3),x(0,4) ==> initial statex(1,1),x(1,2),x(1,3),x(1,4) x(2,1),x(2,2),x(2,3),x(2,4) x(3,1),x(3,2),x(3,3),x(3,4) ..................................... x(N,1), x(N,2), x(N,3), x(N,4)
Wins byfinal state !
Except if P2 finds an illegal transition!
==> P2 can check the consistency of one 3-uple per line
==> requests space log(N) ( = position of the 3-uple)
EXPSPACE-complete unobservable games
The 1P+unknown initial state in the unobservable case is EXPSPACE-complete (games in succinct form).
2P+unobservable as well.
2EXPTIME-complete PO games
The two-player PO case,or 1P+random PO is 2EXP-complete (games in succinct form).
(2P = 1P+random because of UD)
Undecidable games (B. Hearn)
The three-player PO case is undecidable. (two players against one,not allowed to communicate)
Hummm ?
Do you know a PO game in which you can ensure a win with probability 1 ?
Another formalization
==> much more satisfactory (might have drawbacks as well...)
c
Madani et al.
1 player + random = undecidable (even without opponent!)
c
Madani et al.
1 player + random = undecidable.==> answers a (related) question by Papadimitriou and Tsitsiklis.
Proof ?
Based on the emptiness problem for probabilistic finite automata (see Paz 71):
Given a probabilistic finite automaton,is there a word accepted with proba at least c ?==> undecidable
Consequence for unobservable games
1 player + random = undecidable==> 2 players = undecidable.
c
Proof of undecidability with 1 player against random ==> undecidability with 2 players
How to simulate 1 player + random with 2 players ?
A random node to be rewritten
A random node to be rewritten
A random node to be rewritten
Rewritten as follows:Player 1 chooses a in [[0,N-1]]
Player 2 chooses b in [[0,N-1]]
c=(a+b) modulo N
Go to tc
Each player can force the game to be equivalent to the initial one (by playing uniformly)==> the proba of winning for player 1 (in case of perfect play) is the same as for for the initial game==> undecidability!
Important remark
Existence of a strategy for winning with proba 0.5 = also undecidable for therestriction to games in which the probais >0.6 or not just a subtleprecision trouble.
So what ?
We have seen that unbounded horizon + partial observability + natural criterion (not sure win)==> undecidabilitycontrarily to what is expected from usual definitions.
What about bounded horizon, 2P ?Clearly decidable
Complexity ?
Algorithms ? (==> coevolution & LP)
Complexity (2P, 0-sum, no random)
Unbounded Exponential Polynomial horizon horizon horizon
FullObservability EXP EXP PSPACE
No obs EXPSPACE NEXP(X=100%) (Hasslum et al, 2000)
Partially 2EXP EXPSPACE Observable (Rintanen) (Mundhenk)(X=100%)
Simult. Actions ? EXPSPACE ? converges to Nash (Robinson 51)
Fictitious play
TODO
Improvements for KxK matrix game: approximations
There exists approximations in size O(log(K)/2) [Althoefer]
Such an approximation can be found in time O(Klog K / 2) [Grigoriadis et al]: basically a stochastic FP
Improvements for KxK matrix game: exact solution if k-sparse
There exists approximations in size O(log(K)/2) [Althoefer]
Such an approximation can be found in time O(Klog K / 2) [Grigoriadis et al]: basically a stochastic FP
Improvements for KxK matrix game: approximations
There exists approximations in size O(log(K)/2) [Althoefer]
Such an approximation can be found in time O(Klog K / 2) [Grigoriadis et al]: basically a stochastic FP
Exact solution in time (Auger, Ruette, Teytaud) O (K log K k 2k + poly(k) ) if solution k-sparse (good only if k smaller than log(K)/log(log(K)) ! better ?)
Improvements for KxK matrix game: approximations
So, LP & FP are two tools for matrix games.
LP programming can be adapted to PO games without building the complete matrix (using information sets).
The same for FP variants ?
Conclusions
There are still natural questions which provide nice decidability problemsMadani et al (1 player against random, no observability), extended here to 2 players with no random ==> undecidable problems less than the Halting problem ? Solving zero-sum matrix-games is still an active area of researchApproximate cases
Sparse case
Open problems
Phantom-Go undecidable ? (or other real game...)
Complexity of Go with Chinese rules ? (conjectured: PSPACE or EXPTIME; proved PSPACE-hard + EXPSPACE)
More to say about epistemic games (internal state not modified)
Frontier of undecidability in PO games ?(100% halting game: 2P become decidable)
Chess with finitely many pieces on infinite board: decidability of forced-mate ? (n-move: Brumleve et al, 2012, simulation in Presburger (thanks S. Riis :-) )
Click to edit the title text format
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level
Top Related