Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed...
Transcript of Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed...
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Distributed Asynchronous Computation of Fixed Points andApplications in Dynamic Programming
Dimitri P. Bertsekas
Laboratory for Information and Decision SystemsMassachusetts Institute of Technology
PCO ’13Cambridge, MA
May 2013
Joint Work with Huizhen (Janey) Yu
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Summary
Background Review
Asynchronous iterative fixed point methods
Convergence issues
Dynamic programming (DP) applications
New Research on Asynchronous DP Algorithms
Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)
Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)
A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system
Convergence properties are restored/enhanced
Generalizations and abstractions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Summary
Background Review
Asynchronous iterative fixed point methods
Convergence issues
Dynamic programming (DP) applications
New Research on Asynchronous DP Algorithms
Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)
Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)
A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system
Convergence properties are restored/enhanced
Generalizations and abstractions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Summary
Background Review
Asynchronous iterative fixed point methods
Convergence issues
Dynamic programming (DP) applications
New Research on Asynchronous DP Algorithms
Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)
Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)
A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system
Convergence properties are restored/enhanced
Generalizations and abstractions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Summary
Background Review
Asynchronous iterative fixed point methods
Convergence issues
Dynamic programming (DP) applications
New Research on Asynchronous DP Algorithms
Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)
Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)
A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system
Convergence properties are restored/enhanced
Generalizations and abstractions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Summary
Background Review
Asynchronous iterative fixed point methods
Convergence issues
Dynamic programming (DP) applications
New Research on Asynchronous DP Algorithms
Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)
Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)
A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system
Convergence properties are restored/enhanced
Generalizations and abstractions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Summary
Background Review
Asynchronous iterative fixed point methods
Convergence issues
Dynamic programming (DP) applications
New Research on Asynchronous DP Algorithms
Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)
Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)
A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system
Convergence properties are restored/enhanced
Generalizations and abstractions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Summary
Background Review
Asynchronous iterative fixed point methods
Convergence issues
Dynamic programming (DP) applications
New Research on Asynchronous DP Algorithms
Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)
Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)
A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system
Convergence properties are restored/enhanced
Generalizations and abstractions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Summary
Background Review
Asynchronous iterative fixed point methods
Convergence issues
Dynamic programming (DP) applications
New Research on Asynchronous DP Algorithms
Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)
Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)
A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system
Convergence properties are restored/enhanced
Generalizations and abstractions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
References
Current work
D. P. Bertsekas and H. Yu, “Q-Learning and Enhanced Policy Iteration inDiscounted Dynamic Programming," Math. of OR, 2012
H. Yu and D. P. Bertsekas, "Q-Learning and Policy Iteration Algorithmsfor Stochastic Shortest Path Problems," to appear in Annals of OR
D. P. Bertsekas and H. Yu, “Distributed Asynchronous Policy Iteration,"Proc. Allerton Conference, Sept. 2010
More works in progress
Connection to early work on theory of totally asynchronous algorithms
D. P. Bertsekas, "Distributed Dynamic Programming," IEEE Transactionson Aut. Control, Vol. AC-27, 1982
D. P. Bertsekas, "Distributed Asynchronous Computation of FixedPoints," Mathematical Programming, Vol. 27, 1983
D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation:Numerical Methods, Prentice-Hall, 1989
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
References
Current work
D. P. Bertsekas and H. Yu, “Q-Learning and Enhanced Policy Iteration inDiscounted Dynamic Programming," Math. of OR, 2012
H. Yu and D. P. Bertsekas, "Q-Learning and Policy Iteration Algorithmsfor Stochastic Shortest Path Problems," to appear in Annals of OR
D. P. Bertsekas and H. Yu, “Distributed Asynchronous Policy Iteration,"Proc. Allerton Conference, Sept. 2010
More works in progress
Connection to early work on theory of totally asynchronous algorithms
D. P. Bertsekas, "Distributed Dynamic Programming," IEEE Transactionson Aut. Control, Vol. AC-27, 1982
D. P. Bertsekas, "Distributed Asynchronous Computation of FixedPoints," Mathematical Programming, Vol. 27, 1983
D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation:Numerical Methods, Prentice-Hall, 1989
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Outline
1 Asynchronous Computation of Fixed Points
2 Value and Policy Iteration for Discounted MDP
3 New Asynchronous Policy Iteration
4 Generalizations
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Outline
1 Asynchronous Computation of Fixed Points
2 Value and Policy Iteration for Discounted MDP
3 New Asynchronous Policy Iteration
4 Generalizations
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm
Routing algorithm of the ARPANET (1969)
Based on the Bellman-Ford shortest path algorithm
J(i) = minj=0,1,...,n
{aij + J(j)
}, i = 1, . . . , n,
where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j
Original ambitious implementation included congestion dependent aij
Failed miserably because of violent oscillations
The algorithm was “stabilized" by making aij essentially constant (1974)
Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm
Routing algorithm of the ARPANET (1969)
Based on the Bellman-Ford shortest path algorithm
J(i) = minj=0,1,...,n
{aij + J(j)
}, i = 1, . . . , n,
where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j
Original ambitious implementation included congestion dependent aij
Failed miserably because of violent oscillations
The algorithm was “stabilized" by making aij essentially constant (1974)
Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm
Routing algorithm of the ARPANET (1969)
Based on the Bellman-Ford shortest path algorithm
J(i) = minj=0,1,...,n
{aij + J(j)
}, i = 1, . . . , n,
where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j
Original ambitious implementation included congestion dependent aij
Failed miserably because of violent oscillations
The algorithm was “stabilized" by making aij essentially constant (1974)
Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm
Routing algorithm of the ARPANET (1969)
Based on the Bellman-Ford shortest path algorithm
J(i) = minj=0,1,...,n
{aij + J(j)
}, i = 1, . . . , n,
where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j
Original ambitious implementation included congestion dependent aij
Failed miserably because of violent oscillations
The algorithm was “stabilized" by making aij essentially constant (1974)
Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm
Routing algorithm of the ARPANET (1969)
Based on the Bellman-Ford shortest path algorithm
J(i) = minj=0,1,...,n
{aij + J(j)
}, i = 1, . . . , n,
where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j
Original ambitious implementation included congestion dependent aij
Failed miserably because of violent oscillations
The algorithm was “stabilized" by making aij essentially constant (1974)
Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Early Work on Asynchronous Iterative Algorithms
Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis
Sup-norm linear contractive iterations by Chazan and Miranker (1969)
Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)
DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)
Many subsequent works (nonexpansive, network, and other iterations)
Types of Asynchronism
All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings
Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Early Work on Asynchronous Iterative Algorithms
Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis
Sup-norm linear contractive iterations by Chazan and Miranker (1969)
Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)
DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)
Many subsequent works (nonexpansive, network, and other iterations)
Types of Asynchronism
All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings
Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Early Work on Asynchronous Iterative Algorithms
Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis
Sup-norm linear contractive iterations by Chazan and Miranker (1969)
Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)
DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)
Many subsequent works (nonexpansive, network, and other iterations)
Types of Asynchronism
All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings
Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Early Work on Asynchronous Iterative Algorithms
Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis
Sup-norm linear contractive iterations by Chazan and Miranker (1969)
Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)
DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)
Many subsequent works (nonexpansive, network, and other iterations)
Types of Asynchronism
All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings
Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Early Work on Asynchronous Iterative Algorithms
Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis
Sup-norm linear contractive iterations by Chazan and Miranker (1969)
Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)
DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)
Many subsequent works (nonexpansive, network, and other iterations)
Types of Asynchronism
All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings
Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Early Work on Asynchronous Iterative Algorithms
Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis
Sup-norm linear contractive iterations by Chazan and Miranker (1969)
Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)
DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)
Many subsequent works (nonexpansive, network, and other iterations)
Types of Asynchronism
All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings
Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Early Work on Asynchronous Iterative Algorithms
Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis
Sup-norm linear contractive iterations by Chazan and Miranker (1969)
Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)
DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)
Many subsequent works (nonexpansive, network, and other iterations)
Types of Asynchronism
All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings
Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Distributed "Totally" Asynchronous Framework for Fixed PointComputation
Consider solution of general fixed point problem J = TJ, or
J(i) = Ti(J(1), . . . , J(n)
), i = 1, . . . , n
Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n
Asynchronous Fixed Point Algorithm
Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i
Update of processor i [with “delays" t − τij (t)]
J t+1(i) =
{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)
)if t ∈ Ti ,
J t (i) if t /∈ Ti .
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Distributed "Totally" Asynchronous Framework for Fixed PointComputation
Consider solution of general fixed point problem J = TJ, or
J(i) = Ti(J(1), . . . , J(n)
), i = 1, . . . , n
Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n
Asynchronous Fixed Point Algorithm
Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i
Update of processor i [with “delays" t − τij (t)]
J t+1(i) =
{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)
)if t ∈ Ti ,
J t (i) if t /∈ Ti .
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Distributed "Totally" Asynchronous Framework for Fixed PointComputation
Consider solution of general fixed point problem J = TJ, or
J(i) = Ti(J(1), . . . , J(n)
), i = 1, . . . , n
Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n
Asynchronous Fixed Point Algorithm
Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i
Update of processor i [with “delays" t − τij (t)]
J t+1(i) =
{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)
)if t ∈ Ti ,
J t (i) if t /∈ Ti .
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Distributed "Totally" Asynchronous Framework for Fixed PointComputation
Consider solution of general fixed point problem J = TJ, or
J(i) = Ti(J(1), . . . , J(n)
), i = 1, . . . , n
Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n
Asynchronous Fixed Point Algorithm
Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i
Update of processor i [with “delays" t − τij (t)]
J t+1(i) =
{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)
)if t ∈ Ti ,
J t (i) if t /∈ Ti .
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Distributed "Totally" Asynchronous Framework for Fixed PointComputation
Consider solution of general fixed point problem J = TJ, or
J(i) = Ti(J(1), . . . , J(n)
), i = 1, . . . , n
Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n
Asynchronous Fixed Point Algorithm
Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i
Update of processor i [with “delays" t − τij (t)]
J t+1(i) =
{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)
)if t ∈ Ti ,
J t (i) if t /∈ Ti .
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Distributed Convergence of Fixed Point Iterations
A general theorem for “totally asynchronous" iterations, i.e., Ti are infinitesets and τij (t)→∞ as t →∞ (Bertsekas, 1983)
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Assume there is a nested sequence of sets S(k + 1) ⊂ S(k) such that(Synchronous Convergence Condition) We have
TJ ∈ S(k + 1), ∀ J ∈ S(k),
and the limit points of all sequences {Jk} with Jk ∈ S(k), for all k , are fixedpoints of T .(Box Condition) S(k) is a Cartesian product:
S(k) = S1(k)× · · · × Sn(k)
Then, if J0 ∈ S(0), every limit point of {J t} is a fixed point of T .
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Distributed Convergence of Fixed Point Iterations
A general theorem for “totally asynchronous" iterations, i.e., Ti are infinitesets and τij (t)→∞ as t →∞ (Bertsekas, 1983)
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Assume there is a nested sequence of sets S(k + 1) ⊂ S(k) such that(Synchronous Convergence Condition) We have
TJ ∈ S(k + 1), ∀ J ∈ S(k),
and the limit points of all sequences {Jk} with Jk ∈ S(k), for all k , are fixedpoints of T .(Box Condition) S(k) is a Cartesian product:
S(k) = S1(k)× · · · × Sn(k)
Then, if J0 ∈ S(0), every limit point of {J t} is a fixed point of T .
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Distributed Convergence of Fixed Point Iterations
A general theorem for “totally asynchronous" iterations, i.e., Ti are infinitesets and τij (t)→∞ as t →∞ (Bertsekas, 1983)
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Assume there is a nested sequence of sets S(k + 1) ⊂ S(k) such that(Synchronous Convergence Condition) We have
TJ ∈ S(k + 1), ∀ J ∈ S(k),
and the limit points of all sequences {Jk} with Jk ∈ S(k), for all k , are fixedpoints of T .(Box Condition) S(k) is a Cartesian product:
S(k) = S1(k)× · · · × Sn(k)
Then, if J0 ∈ S(0), every limit point of {J t} is a fixed point of T .
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Distributed Convergence of Fixed Point Iterations
A general theorem for “totally asynchronous" iterations, i.e., Ti are infinitesets and τij (t)→∞ as t →∞ (Bertsekas, 1983)
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Assume there is a nested sequence of sets S(k + 1) ⊂ S(k) such that(Synchronous Convergence Condition) We have
TJ ∈ S(k + 1), ∀ J ∈ S(k),
and the limit points of all sequences {Jk} with Jk ∈ S(k), for all k , are fixedpoints of T .(Box Condition) S(k) is a Cartesian product:
S(k) = S1(k)× · · · × Sn(k)
Then, if J0 ∈ S(0), every limit point of {J t} is a fixed point of T .
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Applications of the Theorem
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Major contexts where the theorem applies:
T is a sup-norm contraction with fixed point J∗ and modulus α:
S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B
T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,
S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with
J ≤ TJ ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.
Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Applications of the Theorem
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Major contexts where the theorem applies:
T is a sup-norm contraction with fixed point J∗ and modulus α:
S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B
T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,
S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with
J ≤ T J ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.
Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Applications of the Theorem
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Major contexts where the theorem applies:
T is a sup-norm contraction with fixed point J∗ and modulus α:
S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B
T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,
S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with
J ≤ T J ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.
Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Applications of the Theorem
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Major contexts where the theorem applies:
T is a sup-norm contraction with fixed point J∗ and modulus α:
S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B
T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,
S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with
J ≤ T J ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.
Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Applications of the Theorem
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
S1(0) S2(0)
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Major contexts where the theorem applies:
T is a sup-norm contraction with fixed point J∗ and modulus α:
S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B
T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,
S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with
J ≤ T J ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.
Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Asynchronous Convergence Mechanism
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Iterations J2 Iteration
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Iterations J2 Iteration
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
1
Once we are in S(k), iteration of any one component (say J1) keeps usin the current box
Once all components have been iterated once, we pass into the next boxS(k + 1) (permanently)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Asynchronous Convergence Mechanism
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Iterations J2 Iteration
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Iterations J2 Iteration
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m = ∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
1
Once we are in S(k), iteration of any one component (say J1) keeps usin the current box
Once all components have been iterated once, we pass into the next boxS(k + 1) (permanently)
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Finding Fixed Points of Parametric Mappings
Problem: Find a fixed point J∗ of a mapping T : <n 7→ <n of the form
(TJ)(i) = minµ∈Mi
(TµJ)(i), i = 1, . . . , n,
where µ is a parameter from some setMi .
Common idea (central in DP policy iteration): Instead of T , we iteratewith some Tµ, then change µ once in a while
J TJ Rµ =�r | Tµ(Φr) = T (Φr)
�
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ Rµ =�r | Tµ(Φr) = T (Φr)
�
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
µ-Update Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
µ-Update Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Finding Fixed Points of Parametric Mappings
Problem: Find a fixed point J∗ of a mapping T : <n 7→ <n of the form
(TJ)(i) = minµ∈Mi
(TµJ)(i), i = 1, . . . , n,
where µ is a parameter from some setMi .
Common idea (central in DP policy iteration): Instead of T , we iteratewith some Tµ, then change µ once in a while
J TJ Rµ =�r | Tµ(Φr) = T (Φr)
�
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ Rµ =�r | Tµ(Φr) = T (Φr)
�
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
µ-Update Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
µ-Update Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Distributed Asynchronous Parametric Fixed Point Iterations
Partition J and µ into components, J(1), . . . , J(n) and µ(1), . . . , µ(n)
Asynchronous Algorithm
Processor i , at each time does one of three things:Does nothing orUpdates J(i) by setting it to
J(i) := (Tµ(i)J)(i)
orUpdates J(i) and µ(i) by setting
(TJ)(i) := minµ∈Mi
(TµJ)(i)
andµ(i) := arg min
µ∈Mi(TµJ)(i)
A very chaotic environment!
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Distributed Asynchronous Parametric Fixed Point Iterations
Partition J and µ into components, J(1), . . . , J(n) and µ(1), . . . , µ(n)
Asynchronous Algorithm
Processor i , at each time does one of three things:Does nothing orUpdates J(i) by setting it to
J(i) := (Tµ(i)J)(i)
orUpdates J(i) and µ(i) by setting
(TJ)(i) := minµ∈Mi
(TµJ)(i)
andµ(i) := arg min
µ∈Mi(TµJ)(i)
A very chaotic environment!
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Difficulty with Parametric Fixed Point Algorithm
J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
: Fixed Point of T
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
: Fixed Point of
Distributed Asynchronous Computation of Fixed Points Classical Value and Policy Iteration for Discounted MDP Distributed Asynchronous Policy Iteration Interlocking Research Directions - Generalizations
A More Abstract View of What Follows
We want to find a fixed point J∗ of a mapping T : �n �→ �n of the form
(TJ)(i) = minµ∈Mi
(TµJ)(i), i = 1, . . . , n,
where µ is a parameter from some set Mi .
We update J in two ways:
Iterate with T : J �→ TJ (cf. value iteration/DP), ORPick a µ and iterate with Tµ: J �→ TµJ (cf. policy evaluation/DP)
Difficulty: Tµ has different fixed point than T ... so iterations with Tµ aimat a target other than J∗
Our key idea (abstractly): Embed both T and Tµ within another (uniform)contraction mapping Fµ that has the same fixed point for all µ
The uniform contraction mapping Fµ operates on the larger space ofQ-factors
In the DP context, Fµ is associated with an optimal stopping problem
Most of what follows applies beyond DP
Tµ has different fixed point than T ... so the target of the iterations keepschanging
A (restrictive) convergence result: If each Tµ is both a sup-normcontraction and is monotone, convergence can be shown assuming thatTµ0 J0 ≤ J0
Without the condition Tµ0 J0 ≤ J0, the synchronous algorithm convergesbut the asynchronous algorithm may oscillate
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Difficulty with Parametric Fixed Point Algorithm
J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
: Fixed Point of T
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
: Fixed Point of
Distributed Asynchronous Computation of Fixed Points Classical Value and Policy Iteration for Discounted MDP Distributed Asynchronous Policy Iteration Interlocking Research Directions - Generalizations
A More Abstract View of What Follows
We want to find a fixed point J∗ of a mapping T : �n �→ �n of the form
(TJ)(i) = minµ∈Mi
(TµJ)(i), i = 1, . . . , n,
where µ is a parameter from some set Mi .
We update J in two ways:
Iterate with T : J �→ TJ (cf. value iteration/DP), ORPick a µ and iterate with Tµ: J �→ TµJ (cf. policy evaluation/DP)
Difficulty: Tµ has different fixed point than T ... so iterations with Tµ aimat a target other than J∗
Our key idea (abstractly): Embed both T and Tµ within another (uniform)contraction mapping Fµ that has the same fixed point for all µ
The uniform contraction mapping Fµ operates on the larger space ofQ-factors
In the DP context, Fµ is associated with an optimal stopping problem
Most of what follows applies beyond DP
Tµ has different fixed point than T ... so the target of the iterations keepschanging
A (restrictive) convergence result: If each Tµ is both a sup-normcontraction and is monotone, convergence can be shown assuming thatTµ0 J0 ≤ J0
Without the condition Tµ0 J0 ≤ J0, the synchronous algorithm convergesbut the asynchronous algorithm may oscillate
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Difficulty with Parametric Fixed Point Algorithm
J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
: Fixed Point of T
J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
Simplex
Policy Evaluation Improvement Exploration Enhancement
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Subgradient Cutting plane Interior point Subgradient
Polyhedral approximation
LPs are solved by simplex method
NLPs are solved by gradient/Newton methods.
Convex programs are special cases of NLPs.
Modern view: Post 1990s
LPs are often best solved by nonsimplex/convex methods.
Convex problems are often solved by the same methods as LPs.
Nondifferentiability and piecewise linearity are common features.
Primal Problem Description
Dual Problem Description
Vertical Distances
Crossing Point Differentials
Values f(x) Crossing points f∗(y)
−f∗1 (y) f∗
1 (y) + f∗2 (−y) f∗
2 (−y)
Slope y∗ Slope y
A union of points An intersection of halfspaces
minx
�f1(x) + f2(x)
�= maxy
�f∗1 (y) + f∗
2 (−y)�
1
: Fixed Point of
Distributed Asynchronous Computation of Fixed Points Classical Value and Policy Iteration for Discounted MDP Distributed Asynchronous Policy Iteration Interlocking Research Directions - Generalizations
A More Abstract View of What Follows
We want to find a fixed point J∗ of a mapping T : �n �→ �n of the form
(TJ)(i) = minµ∈Mi
(TµJ)(i), i = 1, . . . , n,
where µ is a parameter from some set Mi .
We update J in two ways:
Iterate with T : J �→ TJ (cf. value iteration/DP), ORPick a µ and iterate with Tµ: J �→ TµJ (cf. policy evaluation/DP)
Difficulty: Tµ has different fixed point than T ... so iterations with Tµ aimat a target other than J∗
Our key idea (abstractly): Embed both T and Tµ within another (uniform)contraction mapping Fµ that has the same fixed point for all µ
The uniform contraction mapping Fµ operates on the larger space ofQ-factors
In the DP context, Fµ is associated with an optimal stopping problem
Most of what follows applies beyond DP
Tµ has different fixed point than T ... so the target of the iterations keepschanging
A (restrictive) convergence result: If each Tµ is both a sup-normcontraction and is monotone, convergence can be shown assuming thatTµ0 J0 ≤ J0
Without the condition Tµ0 J0 ≤ J0, the synchronous algorithm convergesbut the asynchronous algorithm may oscillate
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Addressing the Difficulty: A High-Level View
Assume that each Tµ is a sup-norm contraction
Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ
The uniform contraction mapping Gµ operates on a larger space
In the DP context, Gµ is associated with an optimal stopping problem
Most of what follows applies beyond DP
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Addressing the Difficulty: A High-Level View
Assume that each Tµ is a sup-norm contraction
Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ
The uniform contraction mapping Gµ operates on a larger space
In the DP context, Gµ is associated with an optimal stopping problem
Most of what follows applies beyond DP
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Addressing the Difficulty: A High-Level View
Assume that each Tµ is a sup-norm contraction
Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ
The uniform contraction mapping Gµ operates on a larger space
In the DP context, Gµ is associated with an optimal stopping problem
Most of what follows applies beyond DP
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Addressing the Difficulty: A High-Level View
Assume that each Tµ is a sup-norm contraction
Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ
The uniform contraction mapping Gµ operates on a larger space
In the DP context, Gµ is associated with an optimal stopping problem
Most of what follows applies beyond DP
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Addressing the Difficulty: A High-Level View
Assume that each Tµ is a sup-norm contraction
Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ
The uniform contraction mapping Gµ operates on a larger space
In the DP context, Gµ is associated with an optimal stopping problem
Most of what follows applies beyond DP
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Outline
1 Asynchronous Computation of Fixed Points
2 Value and Policy Iteration for Discounted MDP
3 New Asynchronous Policy Iteration
4 Generalizations
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Dynamic Programming - Markovian Decision Problems (MDP)
System: Controlled Markov chain w/ transition probabilities pij (u)
States: i = 1, . . . , n
Controls: u ∈ U(i)
Cost per stage: g(i, u, j)
Stationary policy: State to control mapping µ; apply µ(i) when at state i
Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑
k=0
αk g(ik , µ(ik ), ik+1
)where
ik = state at time k , ik+1 = state at time k + 1,
α : discount factor 0 < α < 1
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Dynamic Programming - Markovian Decision Problems (MDP)
System: Controlled Markov chain w/ transition probabilities pij (u)
States: i = 1, . . . , n
Controls: u ∈ U(i)
Cost per stage: g(i, u, j)
Stationary policy: State to control mapping µ; apply µ(i) when at state i
Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑
k=0
αk g(ik , µ(ik ), ik+1
)where
ik = state at time k , ik+1 = state at time k + 1,
α : discount factor 0 < α < 1
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Dynamic Programming - Markovian Decision Problems (MDP)
System: Controlled Markov chain w/ transition probabilities pij (u)
States: i = 1, . . . , n
Controls: u ∈ U(i)
Cost per stage: g(i, u, j)
Stationary policy: State to control mapping µ; apply µ(i) when at state i
Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑
k=0
αk g(ik , µ(ik ), ik+1
)where
ik = state at time k , ik+1 = state at time k + 1,
α : discount factor 0 < α < 1
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Dynamic Programming - Markovian Decision Problems (MDP)
System: Controlled Markov chain w/ transition probabilities pij (u)
States: i = 1, . . . , n
Controls: u ∈ U(i)
Cost per stage: g(i, u, j)
Stationary policy: State to control mapping µ; apply µ(i) when at state i
Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑
k=0
αk g(ik , µ(ik ), ik+1
)where
ik = state at time k , ik+1 = state at time k + 1,
α : discount factor 0 < α < 1
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Dynamic Programming - Markovian Decision Problems (MDP)
System: Controlled Markov chain w/ transition probabilities pij (u)
States: i = 1, . . . , n
Controls: u ∈ U(i)
Cost per stage: g(i, u, j)
Stationary policy: State to control mapping µ; apply µ(i) when at state i
Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑
k=0
αk g(ik , µ(ik ), ik+1
)where
ik = state at time k , ik+1 = state at time k + 1,
α : discount factor 0 < α < 1
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Dynamic Programming - Markovian Decision Problems (MDP)
System: Controlled Markov chain w/ transition probabilities pij (u)
States: i = 1, . . . , n
Controls: u ∈ U(i)
Cost per stage: g(i, u, j)
Stationary policy: State to control mapping µ; apply µ(i) when at state i
Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑
k=0
αk g(ik , µ(ik ), ik+1
)where
ik = state at time k , ik+1 = state at time k + 1,
α : discount factor 0 < α < 1
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Major DP Results
J∗(i) = Optimal cost starting from state i
Jµ(i) = Optimal cost starting from state i using policy µ
Bellman’s equation:
J∗(i) = minu∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αJ∗(j)
), i = 1, . . . , n
A parametric fixed point equation.
J∗ is the unique fixed point.
An optimal policy minimizes for each i in the RHS of Bellman’s equation.
Bellman’s equation for a policy µ:
Jµ(i) =n∑
j=1
pij(µ(i)
)(g(i, µ(i), j) + αJµ(j)
), i = 1, . . . , n
It is a linear system of equations with Jµ as its unique solution.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Major DP Results
J∗(i) = Optimal cost starting from state i
Jµ(i) = Optimal cost starting from state i using policy µ
Bellman’s equation:
J∗(i) = minu∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αJ∗(j)
), i = 1, . . . , n
A parametric fixed point equation.
J∗ is the unique fixed point.
An optimal policy minimizes for each i in the RHS of Bellman’s equation.
Bellman’s equation for a policy µ:
Jµ(i) =n∑
j=1
pij(µ(i)
)(g(i, µ(i), j) + αJµ(j)
), i = 1, . . . , n
It is a linear system of equations with Jµ as its unique solution.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Major DP Results
J∗(i) = Optimal cost starting from state i
Jµ(i) = Optimal cost starting from state i using policy µ
Bellman’s equation:
J∗(i) = minu∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αJ∗(j)
), i = 1, . . . , n
A parametric fixed point equation.
J∗ is the unique fixed point.
An optimal policy minimizes for each i in the RHS of Bellman’s equation.
Bellman’s equation for a policy µ:
Jµ(i) =n∑
j=1
pij(µ(i)
)(g(i, µ(i), j) + αJµ(j)
), i = 1, . . . , n
It is a linear system of equations with Jµ as its unique solution.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Major DP Results
J∗(i) = Optimal cost starting from state i
Jµ(i) = Optimal cost starting from state i using policy µ
Bellman’s equation:
J∗(i) = minu∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αJ∗(j)
), i = 1, . . . , n
A parametric fixed point equation.
J∗ is the unique fixed point.
An optimal policy minimizes for each i in the RHS of Bellman’s equation.
Bellman’s equation for a policy µ:
Jµ(i) =n∑
j=1
pij(µ(i)
)(g(i, µ(i), j) + αJµ(j)
), i = 1, . . . , n
It is a linear system of equations with Jµ as its unique solution.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Major DP Results
J∗(i) = Optimal cost starting from state i
Jµ(i) = Optimal cost starting from state i using policy µ
Bellman’s equation:
J∗(i) = minu∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αJ∗(j)
), i = 1, . . . , n
A parametric fixed point equation.
J∗ is the unique fixed point.
An optimal policy minimizes for each i in the RHS of Bellman’s equation.
Bellman’s equation for a policy µ:
Jµ(i) =n∑
j=1
pij(µ(i)
)(g(i, µ(i), j) + αJµ(j)
), i = 1, . . . , n
It is a linear system of equations with Jµ as its unique solution.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Major DP Results
J∗(i) = Optimal cost starting from state i
Jµ(i) = Optimal cost starting from state i using policy µ
Bellman’s equation:
J∗(i) = minu∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αJ∗(j)
), i = 1, . . . , n
A parametric fixed point equation.
J∗ is the unique fixed point.
An optimal policy minimizes for each i in the RHS of Bellman’s equation.
Bellman’s equation for a policy µ:
Jµ(i) =n∑
j=1
pij(µ(i)
)(g(i, µ(i), j) + αJµ(j)
), i = 1, . . . , n
It is a linear system of equations with Jµ as its unique solution.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Major DP Results
J∗(i) = Optimal cost starting from state i
Jµ(i) = Optimal cost starting from state i using policy µ
Bellman’s equation:
J∗(i) = minu∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αJ∗(j)
), i = 1, . . . , n
A parametric fixed point equation.
J∗ is the unique fixed point.
An optimal policy minimizes for each i in the RHS of Bellman’s equation.
Bellman’s equation for a policy µ:
Jµ(i) =n∑
j=1
pij(µ(i)
)(g(i, µ(i), j) + αJµ(j)
), i = 1, . . . , n
It is a linear system of equations with Jµ as its unique solution.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Major DP Results
J∗(i) = Optimal cost starting from state i
Jµ(i) = Optimal cost starting from state i using policy µ
Bellman’s equation:
J∗(i) = minu∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αJ∗(j)
), i = 1, . . . , n
A parametric fixed point equation.
J∗ is the unique fixed point.
An optimal policy minimizes for each i in the RHS of Bellman’s equation.
Bellman’s equation for a policy µ:
Jµ(i) =n∑
j=1
pij(µ(i)
)(g(i, µ(i), j) + αJµ(j)
), i = 1, . . . , n
It is a linear system of equations with Jµ as its unique solution.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Shorthand Notation - Fixed Point View
Denote by T and Tµ the mappings that map J ∈ <n to the vectors TJand TµJ with components
(TJ)(i) def= min
u∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αJ(j)
), i = 1, . . . , n,
and
(TµJ)(i) def=
n∑j=1
pij(µ(i)
)(g(i, µ(i), j) + αJ(j)
), i = 1, . . . , n
Bellman’s equations are written as
J∗ = TJ∗ = minµ
TµJ∗, Jµ = TµJµ
Key structure: T and Tµ are sup-norm contractions with modulus α,
‖TJ−TJ ′‖∞ = maxi=1,...,n
∣∣(TJ)(i)−(TJ ′)(i)∣∣ ≤ α max
i=1,...,n
∣∣J(i)−J ′(i)∣∣ = α‖J−J ′‖∞
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Shorthand Notation - Fixed Point View
Denote by T and Tµ the mappings that map J ∈ <n to the vectors TJand TµJ with components
(TJ)(i) def= min
u∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αJ(j)
), i = 1, . . . , n,
and
(TµJ)(i) def=
n∑j=1
pij(µ(i)
)(g(i, µ(i), j) + αJ(j)
), i = 1, . . . , n
Bellman’s equations are written as
J∗ = TJ∗ = minµ
TµJ∗, Jµ = TµJµ
Key structure: T and Tµ are sup-norm contractions with modulus α,
‖TJ−TJ ′‖∞ = maxi=1,...,n
∣∣(TJ)(i)−(TJ ′)(i)∣∣ ≤ α max
i=1,...,n
∣∣J(i)−J ′(i)∣∣ = α‖J−J ′‖∞
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Shorthand Notation - Fixed Point View
Denote by T and Tµ the mappings that map J ∈ <n to the vectors TJand TµJ with components
(TJ)(i) def= min
u∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αJ(j)
), i = 1, . . . , n,
and
(TµJ)(i) def=
n∑j=1
pij(µ(i)
)(g(i, µ(i), j) + αJ(j)
), i = 1, . . . , n
Bellman’s equations are written as
J∗ = TJ∗ = minµ
TµJ∗, Jµ = TµJµ
Key structure: T and Tµ are sup-norm contractions with modulus α,
‖TJ−TJ ′‖∞ = maxi=1,...,n
∣∣(TJ)(i)−(TJ ′)(i)∣∣ ≤ α max
i=1,...,n
∣∣J(i)−J ′(i)∣∣ = α‖J−J ′‖∞
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Major (Synchronous) Methods for Finding Fixed Point of T
Value iteration (generic fixed point method): Start with any J0, iterate by
J t+1 = TJ t
Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:
Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)
Both methods converge to J∗:
Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ
Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Major (Synchronous) Methods for Finding Fixed Point of T
Value iteration (generic fixed point method): Start with any J0, iterate by
J t+1 = TJ t
Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:
Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)
Both methods converge to J∗:
Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ
Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Major (Synchronous) Methods for Finding Fixed Point of T
Value iteration (generic fixed point method): Start with any J0, iterate by
J t+1 = TJ t
Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:
Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)
Both methods converge to J∗:
Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ
Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Major (Synchronous) Methods for Finding Fixed Point of T
Value iteration (generic fixed point method): Start with any J0, iterate by
J t+1 = TJ t
Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:
Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)
Both methods converge to J∗:
Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ
Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Major (Synchronous) Methods for Finding Fixed Point of T
Value iteration (generic fixed point method): Start with any J0, iterate by
J t+1 = TJ t
Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:
Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)
Both methods converge to J∗:
Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ
Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Graphical Interpretation of Policy Iteration
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant − f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u,w) | there exists x ∈ X
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant − f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u,w) | there exists x ∈ X
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant − f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u,w) | there exists x ∈ X
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant − f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant − f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1
Cost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant − f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1
Policy Improvement Exact Policy Evaluation Approximate PolicyEvaluation
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant − f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
1
J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→ (J1, µ1)
TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation Approx. Policy Evalu-ation
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈ c
1 − α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant − f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
1
Policy Improvement
Policy Evaluation
Fixed Point
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Failure of Asynchronous Policy Iteration: W-B Example
Counterexample by Williams and Baird (1993)
Deterministic discounted MDP with 6 states arranged in a circle
2 controls available in half the states, 1 control available in the other half
Policy evaluations and improvements are one state at a time, no “delays"
A cycle of 15 iterations is constructed that repeats the initial conditions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Failure of Asynchronous Policy Iteration: W-B Example
Counterexample by Williams and Baird (1993)
Deterministic discounted MDP with 6 states arranged in a circle
2 controls available in half the states, 1 control available in the other half
Policy evaluations and improvements are one state at a time, no “delays"
A cycle of 15 iterations is constructed that repeats the initial conditions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Failure of Asynchronous Policy Iteration: W-B Example
Counterexample by Williams and Baird (1993)
Deterministic discounted MDP with 6 states arranged in a circle
2 controls available in half the states, 1 control available in the other half
Policy evaluations and improvements are one state at a time, no “delays"
A cycle of 15 iterations is constructed that repeats the initial conditions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Failure of Asynchronous Policy Iteration: W-B Example
Counterexample by Williams and Baird (1993)
Deterministic discounted MDP with 6 states arranged in a circle
2 controls available in half the states, 1 control available in the other half
Policy evaluations and improvements are one state at a time, no “delays"
A cycle of 15 iterations is constructed that repeats the initial conditions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Failure of Asynchronous Policy Iteration: W-B Example
Counterexample by Williams and Baird (1993)
Deterministic discounted MDP with 6 states arranged in a circle
2 controls available in half the states, 1 control available in the other half
Policy evaluations and improvements are one state at a time, no “delays"
A cycle of 15 iterations is constructed that repeats the initial conditions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Outline
1 Asynchronous Computation of Fixed Points
2 Value and Policy Iteration for Discounted MDP
3 New Asynchronous Policy Iteration
4 Generalizations
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Rectifying the Difficulty - Q-Factors
Q-factors, Q(i, u) are functions of state-control pairs (i, u)
The optimal Q-factors are given by
Q∗(i, u) =n∑
j=1
pij (u)(g(i, u, j) + αJ∗(j)
)Q∗(i, u): Cost of starting at i , using u first, then use optimal policy.
Bellman’s equation J∗ = TJ∗ is written as
J∗(i) = minu∈U(i)
Q∗(i, u)
These two equations constitute a fixed point equation in (Q, J)
For a fixed policy µ, the corresponding fixed point equation in (Qµ, Jµ) is
Qµ(i, u) =n∑
j=1
pij (u)(g(i, u, j) + αQµ(j, µ(j))
)Jµ(i) = Qµ(i, µ(i))
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Rectifying the Difficulty - Q-Factors
Q-factors, Q(i, u) are functions of state-control pairs (i, u)
The optimal Q-factors are given by
Q∗(i, u) =n∑
j=1
pij (u)(g(i, u, j) + αJ∗(j)
)Q∗(i, u): Cost of starting at i , using u first, then use optimal policy.
Bellman’s equation J∗ = TJ∗ is written as
J∗(i) = minu∈U(i)
Q∗(i, u)
These two equations constitute a fixed point equation in (Q, J)
For a fixed policy µ, the corresponding fixed point equation in (Qµ, Jµ) is
Qµ(i, u) =n∑
j=1
pij (u)(g(i, u, j) + αQµ(j, µ(j))
)Jµ(i) = Qµ(i, µ(i))
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Rectifying the Difficulty - Q-Factors
Q-factors, Q(i, u) are functions of state-control pairs (i, u)
The optimal Q-factors are given by
Q∗(i, u) =n∑
j=1
pij (u)(g(i, u, j) + αJ∗(j)
)Q∗(i, u): Cost of starting at i , using u first, then use optimal policy.
Bellman’s equation J∗ = TJ∗ is written as
J∗(i) = minu∈U(i)
Q∗(i, u)
These two equations constitute a fixed point equation in (Q, J)
For a fixed policy µ, the corresponding fixed point equation in (Qµ, Jµ) is
Qµ(i, u) =n∑
j=1
pij (u)(g(i, u, j) + αQµ(j, µ(j))
)Jµ(i) = Qµ(i, µ(i))
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Rectifying the Difficulty - Q-Factors
Q-factors, Q(i, u) are functions of state-control pairs (i, u)
The optimal Q-factors are given by
Q∗(i, u) =n∑
j=1
pij (u)(g(i, u, j) + αJ∗(j)
)Q∗(i, u): Cost of starting at i , using u first, then use optimal policy.
Bellman’s equation J∗ = TJ∗ is written as
J∗(i) = minu∈U(i)
Q∗(i, u)
These two equations constitute a fixed point equation in (Q, J)
For a fixed policy µ, the corresponding fixed point equation in (Qµ, Jµ) is
Qµ(i, u) =n∑
j=1
pij (u)(g(i, u, j) + αQµ(j, µ(j))
)Jµ(i) = Qµ(i, µ(i))
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Sup-Norm Uniform Contraction Property
Consider Q-factors Q(i, u) and costs J(i). For any µ, define mapping
(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)
)where
Fµ(Q, J)(i, u)def=
n∑j=1
pij (u)(g(i, u, j) + αmin
{J(j),Q(j, µ(j))}
),
Mµ(Q, J)(i) def= min
u∈U(i)Fµ(Q, J)(i, u)
Key fact: A sup-norm contraction w/ common fixed point (Q∗, J∗) for all µ
max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞
}≤ αmax
{‖Q−Q∗‖∞, ‖J−J∗‖∞
}The asynchronous iteration for the fixed point problem
Q := Fµ(Q, J), J := Mµ(Q, J), µ := arbitrary
is convergent.
Even though we operate with different mappings Fµ and Mµ, they allhave a common fixed point.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Sup-Norm Uniform Contraction Property
Consider Q-factors Q(i, u) and costs J(i). For any µ, define mapping
(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)
)where
Fµ(Q, J)(i, u)def=
n∑j=1
pij (u)(g(i, u, j) + αmin
{J(j),Q(j, µ(j))}
),
Mµ(Q, J)(i) def= min
u∈U(i)Fµ(Q, J)(i, u)
Key fact: A sup-norm contraction w/ common fixed point (Q∗, J∗) for all µ
max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞
}≤ αmax
{‖Q−Q∗‖∞, ‖J−J∗‖∞
}The asynchronous iteration for the fixed point problem
Q := Fµ(Q, J), J := Mµ(Q, J), µ := arbitrary
is convergent.
Even though we operate with different mappings Fµ and Mµ, they allhave a common fixed point.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Sup-Norm Uniform Contraction Property
Consider Q-factors Q(i, u) and costs J(i). For any µ, define mapping
(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)
)where
Fµ(Q, J)(i, u)def=
n∑j=1
pij (u)(g(i, u, j) + αmin
{J(j),Q(j, µ(j))}
),
Mµ(Q, J)(i) def= min
u∈U(i)Fµ(Q, J)(i, u)
Key fact: A sup-norm contraction w/ common fixed point (Q∗, J∗) for all µ
max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞
}≤ αmax
{‖Q−Q∗‖∞, ‖J−J∗‖∞
}The asynchronous iteration for the fixed point problem
Q := Fµ(Q, J), J := Mµ(Q, J), µ := arbitrary
is convergent.
Even though we operate with different mappings Fµ and Mµ, they allhave a common fixed point.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Sup-Norm Uniform Contraction Property
Consider Q-factors Q(i, u) and costs J(i). For any µ, define mapping
(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)
)where
Fµ(Q, J)(i, u)def=
n∑j=1
pij (u)(g(i, u, j) + αmin
{J(j),Q(j, µ(j))}
),
Mµ(Q, J)(i) def= min
u∈U(i)Fµ(Q, J)(i, u)
Key fact: A sup-norm contraction w/ common fixed point (Q∗, J∗) for all µ
max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞
}≤ αmax
{‖Q−Q∗‖∞, ‖J−J∗‖∞
}The asynchronous iteration for the fixed point problem
Q := Fµ(Q, J), J := Mµ(Q, J), µ := arbitrary
is convergent.
Even though we operate with different mappings Fµ and Mµ, they allhave a common fixed point.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Asynchronous Policy Iteration: A More Detailed View
Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where
V t (i) = Qt(i, µt (i))
≈ Q-factors of current policy
Processor i at time t does one of three things:
Does nothingOr operates with Fµt at i :
V t+1(i) =n∑
j=1
pij (µt (i))
(g(i, µt (i), j) + αmin
{J t (j),V t (j)}
)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set
J t+1(i) = V t+1(i) = minu∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αmin
{J t (j),V t (j)}
)and sets µt+1(i) to a control that attains the minimum.
Convergence follows by the asynchronous convergence theorem.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Asynchronous Policy Iteration: A More Detailed View
Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where
V t (i) = Qt(i, µt (i))
≈ Q-factors of current policy
Processor i at time t does one of three things:
Does nothingOr operates with Fµt at i :
V t+1(i) =n∑
j=1
pij (µt (i))
(g(i, µt (i), j) + αmin
{J t (j),V t (j)}
)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set
J t+1(i) = V t+1(i) = minu∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αmin
{J t (j),V t (j)}
)and sets µt+1(i) to a control that attains the minimum.
Convergence follows by the asynchronous convergence theorem.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Asynchronous Policy Iteration: A More Detailed View
Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where
V t (i) = Qt(i, µt (i))
≈ Q-factors of current policy
Processor i at time t does one of three things:
Does nothingOr operates with Fµt at i :
V t+1(i) =n∑
j=1
pij (µt (i))
(g(i, µt (i), j) + αmin
{J t (j),V t (j)}
)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set
J t+1(i) = V t+1(i) = minu∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αmin
{J t (j),V t (j)}
)and sets µt+1(i) to a control that attains the minimum.
Convergence follows by the asynchronous convergence theorem.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Asynchronous Policy Iteration: A More Detailed View
Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where
V t (i) = Qt(i, µt (i))
≈ Q-factors of current policy
Processor i at time t does one of three things:
Does nothingOr operates with Fµt at i :
V t+1(i) =n∑
j=1
pij (µt (i))
(g(i, µt (i), j) + αmin
{J t (j),V t (j)}
)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set
J t+1(i) = V t+1(i) = minu∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αmin
{J t (j),V t (j)}
)and sets µt+1(i) to a control that attains the minimum.
Convergence follows by the asynchronous convergence theorem.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Asynchronous Policy Iteration: A More Detailed View
Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where
V t (i) = Qt(i, µt (i))
≈ Q-factors of current policy
Processor i at time t does one of three things:
Does nothingOr operates with Fµt at i :
V t+1(i) =n∑
j=1
pij (µt (i))
(g(i, µt (i), j) + αmin
{J t (j),V t (j)}
)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set
J t+1(i) = V t+1(i) = minu∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αmin
{J t (j),V t (j)}
)and sets µt+1(i) to a control that attains the minimum.
Convergence follows by the asynchronous convergence theorem.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Asynchronous Policy Iteration: A More Detailed View
Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where
V t (i) = Qt(i, µt (i))
≈ Q-factors of current policy
Processor i at time t does one of three things:
Does nothingOr operates with Fµt at i :
V t+1(i) =n∑
j=1
pij (µt (i))
(g(i, µt (i), j) + αmin
{J t (j),V t (j)}
)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set
J t+1(i) = V t+1(i) = minu∈U(i)
n∑j=1
pij (u)(g(i, u, j) + αmin
{J t (j),V t (j)}
)and sets µt+1(i) to a control that attains the minimum.
Convergence follows by the asynchronous convergence theorem.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Williams and Baird Counterexample
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Outline
1 Asynchronous Computation of Fixed Points
2 Value and Policy Iteration for Discounted MDP
3 New Asynchronous Policy Iteration
4 Generalizations
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Generalized Mappings T and Tµ
The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)
Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote
(TJ)(i) = minu∈U(i)
H(i, u, J), (TµJ)(i) = H(i, µ(i), J
)i.e., TJ = minµ TµJ, where the min is taken separately for each component
Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞
Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets
V t+1(i) = H(i, µt (i),min{J t ,V t}
)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets
J t+1(i) = V t+1(i) = minu∈U(i)
H(i, u,min{J t ,V t}
)sets µt+1(i) to a u that attains the minimum.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Generalized Mappings T and Tµ
The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)
Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote
(TJ)(i) = minu∈U(i)
H(i, u, J), (TµJ)(i) = H(i, µ(i), J
)i.e., TJ = minµ TµJ, where the min is taken separately for each component
Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞
Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets
V t+1(i) = H(i, µt (i),min{J t ,V t}
)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets
J t+1(i) = V t+1(i) = minu∈U(i)
H(i, u,min{J t ,V t}
)sets µt+1(i) to a u that attains the minimum.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Generalized Mappings T and Tµ
The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)
Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote
(TJ)(i) = minu∈U(i)
H(i, u, J), (TµJ)(i) = H(i, µ(i), J
)i.e., TJ = minµ TµJ, where the min is taken separately for each component
Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞
Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets
V t+1(i) = H(i, µt (i),min{J t ,V t}
)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets
J t+1(i) = V t+1(i) = minu∈U(i)
H(i, u,min{J t ,V t}
)sets µt+1(i) to a u that attains the minimum.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Generalized Mappings T and Tµ
The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)
Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote
(TJ)(i) = minu∈U(i)
H(i, u, J), (TµJ)(i) = H(i, µ(i), J
)i.e., TJ = minµ TµJ, where the min is taken separately for each component
Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞
Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets
V t+1(i) = H(i, µt (i),min{J t ,V t}
)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets
J t+1(i) = V t+1(i) = minu∈U(i)
H(i, u,min{J t ,V t}
)sets µt+1(i) to a u that attains the minimum.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Generalized Mappings T and Tµ
The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)
Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote
(TJ)(i) = minu∈U(i)
H(i, u, J), (TµJ)(i) = H(i, µ(i), J
)i.e., TJ = minµ TµJ, where the min is taken separately for each component
Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞
Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets
V t+1(i) = H(i, µt (i),min{J t ,V t}
)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets
J t+1(i) = V t+1(i) = minu∈U(i)
H(i, u,min{J t ,V t}
)sets µt+1(i) to a u that attains the minimum.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Generalized Mappings T and Tµ
The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)
Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote
(TJ)(i) = minu∈U(i)
H(i, u, J), (TµJ)(i) = H(i, µ(i), J
)i.e., TJ = minµ TµJ, where the min is taken separately for each component
Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞
Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets
V t+1(i) = H(i, µt (i),min{J t ,V t}
)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets
J t+1(i) = V t+1(i) = minu∈U(i)
H(i, u,min{J t ,V t}
)sets µt+1(i) to a u that attains the minimum.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Generalized Mappings T and Tµ
The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)
Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote
(TJ)(i) = minu∈U(i)
H(i, u, J), (TµJ)(i) = H(i, µ(i), J
)i.e., TJ = minµ TµJ, where the min is taken separately for each component
Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞
Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets
V t+1(i) = H(i, µt (i),min{J t ,V t}
)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets
J t+1(i) = V t+1(i) = minu∈U(i)
H(i, u,min{J t ,V t}
)sets µt+1(i) to a u that attains the minimum.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Generalized Mappings T and Tµ
The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)
Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote
(TJ)(i) = minu∈U(i)
H(i, u, J), (TµJ)(i) = H(i, µ(i), J
)i.e., TJ = minµ TµJ, where the min is taken separately for each component
Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞
Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets
V t+1(i) = H(i, µt (i),min{J t ,V t}
)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets
J t+1(i) = V t+1(i) = minu∈U(i)
H(i, u,min{J t ,V t}
)sets µt+1(i) to a u that attains the minimum.
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
DP Applications with Generalized T and Tµ
DP models beyond discounted with standard policy evaluation
Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems
Multi-agent aggregation
Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
DP Applications with Generalized T and Tµ
DP models beyond discounted with standard policy evaluation
Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems
Multi-agent aggregation
Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
DP Applications with Generalized T and Tµ
DP models beyond discounted with standard policy evaluation
Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems
Multi-agent aggregation
Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
DP Applications with Generalized T and Tµ
DP models beyond discounted with standard policy evaluation
Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems
Multi-agent aggregation
Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
DP Applications with Generalized T and Tµ
DP models beyond discounted with standard policy evaluation
Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems
Multi-agent aggregation
Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
DP Applications with Generalized T and Tµ
DP models beyond discounted with standard policy evaluation
Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems
Multi-agent aggregation
Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
DP Applications with Generalized T and Tµ
DP models beyond discounted with standard policy evaluation
Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems
Multi-agent aggregation
Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
DP Applications with Generalized T and Tµ
DP models beyond discounted with standard policy evaluation
Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems
Multi-agent aggregation
Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
DP Applications with Generalized T and Tµ
DP models beyond discounted with standard policy evaluation
Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems
Multi-agent aggregation
Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Fixed Points of Concave Sup-Norm Contractions
J TJ Rµ =�r | Tµ(Φr) = T (Φr)
�
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ Rµ =�r | Tµ(Φr) = T (Φr)
�
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
µ-Update Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
µ-Update Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Find a fixed point of a mapping T : <n 7→ <n, where the componentsTi (·) : <n 7→ < are concave sup-norm contractions
T is the minimum of a collection of linear mappings Tµ involving theconcave conjugate functions of the concave functions Ti
An asynchronous parametric fixed point algorithm can be used:Linearize at the current JIterate using the linearized mappingUpdate the linearization
All of the above are done asynchronously
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Fixed Points of Concave Sup-Norm Contractions
J TJ Rµ =�r | Tµ(Φr) = T (Φr)
�
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ Rµ =�r | Tµ(Φr) = T (Φr)
�
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
µ-Update Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
µ-Update Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Find a fixed point of a mapping T : <n 7→ <n, where the componentsTi (·) : <n 7→ < are concave sup-norm contractions
T is the minimum of a collection of linear mappings Tµ involving theconcave conjugate functions of the concave functions Ti
An asynchronous parametric fixed point algorithm can be used:Linearize at the current JIterate using the linearized mappingUpdate the linearization
All of the above are done asynchronously
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Fixed Points of Concave Sup-Norm Contractions
J TJ Rµ =�r | Tµ(Φr) = T (Φr)
�
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ Rµ =�r | Tµ(Φr) = T (Φr)
�
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
µ-Update Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
µ-Update Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Find a fixed point of a mapping T : <n 7→ <n, where the componentsTi (·) : <n 7→ < are concave sup-norm contractions
T is the minimum of a collection of linear mappings Tµ involving theconcave conjugate functions of the concave functions Ti
An asynchronous parametric fixed point algorithm can be used:Linearize at the current JIterate using the linearized mappingUpdate the linearization
All of the above are done asynchronously
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Fixed Points of Concave Sup-Norm Contractions
J TJ Rµ =�r | Tµ(Φr) = T (Φr)
�
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ Rµ =�r | Tµ(Φr) = T (Φr)
�
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
µ-Update Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
µ-Update Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Find a fixed point of a mapping T : <n 7→ <n, where the componentsTi (·) : <n 7→ < are concave sup-norm contractions
T is the minimum of a collection of linear mappings Tµ involving theconcave conjugate functions of the concave functions Ti
An asynchronous parametric fixed point algorithm can be used:Linearize at the current JIterate using the linearized mappingUpdate the linearization
All of the above are done asynchronously
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Concluding Remarks
Asynchronous policy iteration has fragile convergence properties
We have provided a new approach to correct the difficulties
Key idea: Embed the problem into one that involves a uniform sup-normcontraction
Extensions to generalized DP models involving:Other DP models (e.g., stochastic shortest path, games, etc)Fixed point problems for nonDP mappings of the form T = minµ∈MTµ,such as concave sup-norm contractions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Concluding Remarks
Asynchronous policy iteration has fragile convergence properties
We have provided a new approach to correct the difficulties
Key idea: Embed the problem into one that involves a uniform sup-normcontraction
Extensions to generalized DP models involving:Other DP models (e.g., stochastic shortest path, games, etc)Fixed point problems for nonDP mappings of the form T = minµ∈MTµ,such as concave sup-norm contractions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Concluding Remarks
Asynchronous policy iteration has fragile convergence properties
We have provided a new approach to correct the difficulties
Key idea: Embed the problem into one that involves a uniform sup-normcontraction
Extensions to generalized DP models involving:Other DP models (e.g., stochastic shortest path, games, etc)Fixed point problems for nonDP mappings of the form T = minµ∈MTµ,such as concave sup-norm contractions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Concluding Remarks
Asynchronous policy iteration has fragile convergence properties
We have provided a new approach to correct the difficulties
Key idea: Embed the problem into one that involves a uniform sup-normcontraction
Extensions to generalized DP models involving:Other DP models (e.g., stochastic shortest path, games, etc)Fixed point problems for nonDP mappings of the form T = minµ∈MTµ,such as concave sup-norm contractions
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations
Thank You!