Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed...

Post on 15-Aug-2020

17 views 0 download

Transcript of Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed...

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Asynchronous Computation of Fixed Points andApplications in Dynamic Programming

Dimitri P. Bertsekas

Laboratory for Information and Decision SystemsMassachusetts Institute of Technology

PCO ’13Cambridge, MA

May 2013

Joint Work with Huizhen (Janey) Yu

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

References

Current work

D. P. Bertsekas and H. Yu, “Q-Learning and Enhanced Policy Iteration inDiscounted Dynamic Programming," Math. of OR, 2012

H. Yu and D. P. Bertsekas, "Q-Learning and Policy Iteration Algorithmsfor Stochastic Shortest Path Problems," to appear in Annals of OR

D. P. Bertsekas and H. Yu, “Distributed Asynchronous Policy Iteration,"Proc. Allerton Conference, Sept. 2010

More works in progress

Connection to early work on theory of totally asynchronous algorithms

D. P. Bertsekas, "Distributed Dynamic Programming," IEEE Transactionson Aut. Control, Vol. AC-27, 1982

D. P. Bertsekas, "Distributed Asynchronous Computation of FixedPoints," Mathematical Programming, Vol. 27, 1983

D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation:Numerical Methods, Prentice-Hall, 1989

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

References

Current work

D. P. Bertsekas and H. Yu, “Q-Learning and Enhanced Policy Iteration inDiscounted Dynamic Programming," Math. of OR, 2012

H. Yu and D. P. Bertsekas, "Q-Learning and Policy Iteration Algorithmsfor Stochastic Shortest Path Problems," to appear in Annals of OR

D. P. Bertsekas and H. Yu, “Distributed Asynchronous Policy Iteration,"Proc. Allerton Conference, Sept. 2010

More works in progress

Connection to early work on theory of totally asynchronous algorithms

D. P. Bertsekas, "Distributed Dynamic Programming," IEEE Transactionson Aut. Control, Vol. AC-27, 1982

D. P. Bertsekas, "Distributed Asynchronous Computation of FixedPoints," Mathematical Programming, Vol. 27, 1983

D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation:Numerical Methods, Prentice-Hall, 1989

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Outline

1 Asynchronous Computation of Fixed Points

2 Value and Policy Iteration for Discounted MDP

3 New Asynchronous Policy Iteration

4 Generalizations

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Outline

1 Asynchronous Computation of Fixed Points

2 Value and Policy Iteration for Discounted MDP

3 New Asynchronous Policy Iteration

4 Generalizations

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm

Routing algorithm of the ARPANET (1969)

Based on the Bellman-Ford shortest path algorithm

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j

Original ambitious implementation included congestion dependent aij

Failed miserably because of violent oscillations

The algorithm was “stabilized" by making aij essentially constant (1974)

Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm

Routing algorithm of the ARPANET (1969)

Based on the Bellman-Ford shortest path algorithm

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j

Original ambitious implementation included congestion dependent aij

Failed miserably because of violent oscillations

The algorithm was “stabilized" by making aij essentially constant (1974)

Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm

Routing algorithm of the ARPANET (1969)

Based on the Bellman-Ford shortest path algorithm

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j

Original ambitious implementation included congestion dependent aij

Failed miserably because of violent oscillations

The algorithm was “stabilized" by making aij essentially constant (1974)

Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm

Routing algorithm of the ARPANET (1969)

Based on the Bellman-Ford shortest path algorithm

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j

Original ambitious implementation included congestion dependent aij

Failed miserably because of violent oscillations

The algorithm was “stabilized" by making aij essentially constant (1974)

Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm

Routing algorithm of the ARPANET (1969)

Based on the Bellman-Ford shortest path algorithm

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j

Original ambitious implementation included congestion dependent aij

Failed miserably because of violent oscillations

The algorithm was “stabilized" by making aij essentially constant (1974)

Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed "Totally" Asynchronous Framework for Fixed PointComputation

Consider solution of general fixed point problem J = TJ, or

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n

Asynchronous Fixed Point Algorithm

Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i

Update of processor i [with “delays" t − τij (t)]

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

J t (i) if t /∈ Ti .

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed "Totally" Asynchronous Framework for Fixed PointComputation

Consider solution of general fixed point problem J = TJ, or

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n

Asynchronous Fixed Point Algorithm

Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i

Update of processor i [with “delays" t − τij (t)]

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

J t (i) if t /∈ Ti .

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed "Totally" Asynchronous Framework for Fixed PointComputation

Consider solution of general fixed point problem J = TJ, or

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n

Asynchronous Fixed Point Algorithm

Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i

Update of processor i [with “delays" t − τij (t)]

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

J t (i) if t /∈ Ti .

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed "Totally" Asynchronous Framework for Fixed PointComputation

Consider solution of general fixed point problem J = TJ, or

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n

Asynchronous Fixed Point Algorithm

Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i

Update of processor i [with “delays" t − τij (t)]

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

J t (i) if t /∈ Ti .

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed "Totally" Asynchronous Framework for Fixed PointComputation

Consider solution of general fixed point problem J = TJ, or

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n

Asynchronous Fixed Point Algorithm

Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i

Update of processor i [with “delays" t − τij (t)]

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

J t (i) if t /∈ Ti .

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Convergence of Fixed Point Iterations

A general theorem for “totally asynchronous" iterations, i.e., Ti are infinitesets and τij (t)→∞ as t →∞ (Bertsekas, 1983)

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Assume there is a nested sequence of sets S(k + 1) ⊂ S(k) such that(Synchronous Convergence Condition) We have

TJ ∈ S(k + 1), ∀ J ∈ S(k),

and the limit points of all sequences {Jk} with Jk ∈ S(k), for all k , are fixedpoints of T .(Box Condition) S(k) is a Cartesian product:

S(k) = S1(k)× · · · × Sn(k)

Then, if J0 ∈ S(0), every limit point of {J t} is a fixed point of T .

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Convergence of Fixed Point Iterations

A general theorem for “totally asynchronous" iterations, i.e., Ti are infinitesets and τij (t)→∞ as t →∞ (Bertsekas, 1983)

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Assume there is a nested sequence of sets S(k + 1) ⊂ S(k) such that(Synchronous Convergence Condition) We have

TJ ∈ S(k + 1), ∀ J ∈ S(k),

and the limit points of all sequences {Jk} with Jk ∈ S(k), for all k , are fixedpoints of T .(Box Condition) S(k) is a Cartesian product:

S(k) = S1(k)× · · · × Sn(k)

Then, if J0 ∈ S(0), every limit point of {J t} is a fixed point of T .

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Convergence of Fixed Point Iterations

A general theorem for “totally asynchronous" iterations, i.e., Ti are infinitesets and τij (t)→∞ as t →∞ (Bertsekas, 1983)

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Assume there is a nested sequence of sets S(k + 1) ⊂ S(k) such that(Synchronous Convergence Condition) We have

TJ ∈ S(k + 1), ∀ J ∈ S(k),

and the limit points of all sequences {Jk} with Jk ∈ S(k), for all k , are fixedpoints of T .(Box Condition) S(k) is a Cartesian product:

S(k) = S1(k)× · · · × Sn(k)

Then, if J0 ∈ S(0), every limit point of {J t} is a fixed point of T .

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Convergence of Fixed Point Iterations

A general theorem for “totally asynchronous" iterations, i.e., Ti are infinitesets and τij (t)→∞ as t →∞ (Bertsekas, 1983)

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Assume there is a nested sequence of sets S(k + 1) ⊂ S(k) such that(Synchronous Convergence Condition) We have

TJ ∈ S(k + 1), ∀ J ∈ S(k),

and the limit points of all sequences {Jk} with Jk ∈ S(k), for all k , are fixedpoints of T .(Box Condition) S(k) is a Cartesian product:

S(k) = S1(k)× · · · × Sn(k)

Then, if J0 ∈ S(0), every limit point of {J t} is a fixed point of T .

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Applications of the Theorem

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Major contexts where the theorem applies:

T is a sup-norm contraction with fixed point J∗ and modulus α:

S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B

T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,

S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with

J ≤ TJ ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.

Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Applications of the Theorem

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Major contexts where the theorem applies:

T is a sup-norm contraction with fixed point J∗ and modulus α:

S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B

T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,

S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with

J ≤ T J ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.

Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Applications of the Theorem

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Major contexts where the theorem applies:

T is a sup-norm contraction with fixed point J∗ and modulus α:

S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B

T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,

S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with

J ≤ T J ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.

Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Applications of the Theorem

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Major contexts where the theorem applies:

T is a sup-norm contraction with fixed point J∗ and modulus α:

S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B

T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,

S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with

J ≤ T J ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.

Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Applications of the Theorem

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Major contexts where the theorem applies:

T is a sup-norm contraction with fixed point J∗ and modulus α:

S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B

T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,

S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with

J ≤ T J ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.

Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Convergence Mechanism

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

J1 Iterations J2 Iteration

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

J1 Iterations J2 Iteration

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

1

Once we are in S(k), iteration of any one component (say J1) keeps usin the current box

Once all components have been iterated once, we pass into the next boxS(k + 1) (permanently)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Convergence Mechanism

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

J1 Iterations J2 Iteration

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

J1 Iterations J2 Iteration

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

1

Once we are in S(k), iteration of any one component (say J1) keeps usin the current box

Once all components have been iterated once, we pass into the next boxS(k + 1) (permanently)

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Finding Fixed Points of Parametric Mappings

Problem: Find a fixed point J∗ of a mapping T : <n 7→ <n of the form

(TJ)(i) = minµ∈Mi

(TµJ)(i), i = 1, . . . , n,

where µ is a parameter from some setMi .

Common idea (central in DP policy iteration): Instead of T , we iteratewith some Tµ, then change µ once in a while

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Finding Fixed Points of Parametric Mappings

Problem: Find a fixed point J∗ of a mapping T : <n 7→ <n of the form

(TJ)(i) = minµ∈Mi

(TµJ)(i), i = 1, . . . , n,

where µ is a parameter from some setMi .

Common idea (central in DP policy iteration): Instead of T , we iteratewith some Tµ, then change µ once in a while

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Asynchronous Parametric Fixed Point Iterations

Partition J and µ into components, J(1), . . . , J(n) and µ(1), . . . , µ(n)

Asynchronous Algorithm

Processor i , at each time does one of three things:Does nothing orUpdates J(i) by setting it to

J(i) := (Tµ(i)J)(i)

orUpdates J(i) and µ(i) by setting

(TJ)(i) := minµ∈Mi

(TµJ)(i)

andµ(i) := arg min

µ∈Mi(TµJ)(i)

A very chaotic environment!

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Asynchronous Parametric Fixed Point Iterations

Partition J and µ into components, J(1), . . . , J(n) and µ(1), . . . , µ(n)

Asynchronous Algorithm

Processor i , at each time does one of three things:Does nothing orUpdates J(i) by setting it to

J(i) := (Tµ(i)J)(i)

orUpdates J(i) and µ(i) by setting

(TJ)(i) := minµ∈Mi

(TµJ)(i)

andµ(i) := arg min

µ∈Mi(TµJ)(i)

A very chaotic environment!

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Difficulty with Parametric Fixed Point Algorithm

J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

: Fixed Point of T

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

: Fixed Point of

Distributed Asynchronous Computation of Fixed Points Classical Value and Policy Iteration for Discounted MDP Distributed Asynchronous Policy Iteration Interlocking Research Directions - Generalizations

A More Abstract View of What Follows

We want to find a fixed point J∗ of a mapping T : �n �→ �n of the form

(TJ)(i) = minµ∈Mi

(TµJ)(i), i = 1, . . . , n,

where µ is a parameter from some set Mi .

We update J in two ways:

Iterate with T : J �→ TJ (cf. value iteration/DP), ORPick a µ and iterate with Tµ: J �→ TµJ (cf. policy evaluation/DP)

Difficulty: Tµ has different fixed point than T ... so iterations with Tµ aimat a target other than J∗

Our key idea (abstractly): Embed both T and Tµ within another (uniform)contraction mapping Fµ that has the same fixed point for all µ

The uniform contraction mapping Fµ operates on the larger space ofQ-factors

In the DP context, Fµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Tµ has different fixed point than T ... so the target of the iterations keepschanging

A (restrictive) convergence result: If each Tµ is both a sup-normcontraction and is monotone, convergence can be shown assuming thatTµ0 J0 ≤ J0

Without the condition Tµ0 J0 ≤ J0, the synchronous algorithm convergesbut the asynchronous algorithm may oscillate

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Difficulty with Parametric Fixed Point Algorithm

J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

: Fixed Point of T

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

: Fixed Point of

Distributed Asynchronous Computation of Fixed Points Classical Value and Policy Iteration for Discounted MDP Distributed Asynchronous Policy Iteration Interlocking Research Directions - Generalizations

A More Abstract View of What Follows

We want to find a fixed point J∗ of a mapping T : �n �→ �n of the form

(TJ)(i) = minµ∈Mi

(TµJ)(i), i = 1, . . . , n,

where µ is a parameter from some set Mi .

We update J in two ways:

Iterate with T : J �→ TJ (cf. value iteration/DP), ORPick a µ and iterate with Tµ: J �→ TµJ (cf. policy evaluation/DP)

Difficulty: Tµ has different fixed point than T ... so iterations with Tµ aimat a target other than J∗

Our key idea (abstractly): Embed both T and Tµ within another (uniform)contraction mapping Fµ that has the same fixed point for all µ

The uniform contraction mapping Fµ operates on the larger space ofQ-factors

In the DP context, Fµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Tµ has different fixed point than T ... so the target of the iterations keepschanging

A (restrictive) convergence result: If each Tµ is both a sup-normcontraction and is monotone, convergence can be shown assuming thatTµ0 J0 ≤ J0

Without the condition Tµ0 J0 ≤ J0, the synchronous algorithm convergesbut the asynchronous algorithm may oscillate

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Difficulty with Parametric Fixed Point Algorithm

J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

: Fixed Point of T

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

: Fixed Point of

Distributed Asynchronous Computation of Fixed Points Classical Value and Policy Iteration for Discounted MDP Distributed Asynchronous Policy Iteration Interlocking Research Directions - Generalizations

A More Abstract View of What Follows

We want to find a fixed point J∗ of a mapping T : �n �→ �n of the form

(TJ)(i) = minµ∈Mi

(TµJ)(i), i = 1, . . . , n,

where µ is a parameter from some set Mi .

We update J in two ways:

Iterate with T : J �→ TJ (cf. value iteration/DP), ORPick a µ and iterate with Tµ: J �→ TµJ (cf. policy evaluation/DP)

Difficulty: Tµ has different fixed point than T ... so iterations with Tµ aimat a target other than J∗

Our key idea (abstractly): Embed both T and Tµ within another (uniform)contraction mapping Fµ that has the same fixed point for all µ

The uniform contraction mapping Fµ operates on the larger space ofQ-factors

In the DP context, Fµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Tµ has different fixed point than T ... so the target of the iterations keepschanging

A (restrictive) convergence result: If each Tµ is both a sup-normcontraction and is monotone, convergence can be shown assuming thatTµ0 J0 ≤ J0

Without the condition Tµ0 J0 ≤ J0, the synchronous algorithm convergesbut the asynchronous algorithm may oscillate

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Addressing the Difficulty: A High-Level View

Assume that each Tµ is a sup-norm contraction

Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ

The uniform contraction mapping Gµ operates on a larger space

In the DP context, Gµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Addressing the Difficulty: A High-Level View

Assume that each Tµ is a sup-norm contraction

Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ

The uniform contraction mapping Gµ operates on a larger space

In the DP context, Gµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Addressing the Difficulty: A High-Level View

Assume that each Tµ is a sup-norm contraction

Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ

The uniform contraction mapping Gµ operates on a larger space

In the DP context, Gµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Addressing the Difficulty: A High-Level View

Assume that each Tµ is a sup-norm contraction

Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ

The uniform contraction mapping Gµ operates on a larger space

In the DP context, Gµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Addressing the Difficulty: A High-Level View

Assume that each Tµ is a sup-norm contraction

Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ

The uniform contraction mapping Gµ operates on a larger space

In the DP context, Gµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Outline

1 Asynchronous Computation of Fixed Points

2 Value and Policy Iteration for Discounted MDP

3 New Asynchronous Policy Iteration

4 Generalizations

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Dynamic Programming - Markovian Decision Problems (MDP)

System: Controlled Markov chain w/ transition probabilities pij (u)

States: i = 1, . . . , n

Controls: u ∈ U(i)

Cost per stage: g(i, u, j)

Stationary policy: State to control mapping µ; apply µ(i) when at state i

Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑

k=0

αk g(ik , µ(ik ), ik+1

)where

ik = state at time k , ik+1 = state at time k + 1,

α : discount factor 0 < α < 1

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Dynamic Programming - Markovian Decision Problems (MDP)

System: Controlled Markov chain w/ transition probabilities pij (u)

States: i = 1, . . . , n

Controls: u ∈ U(i)

Cost per stage: g(i, u, j)

Stationary policy: State to control mapping µ; apply µ(i) when at state i

Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑

k=0

αk g(ik , µ(ik ), ik+1

)where

ik = state at time k , ik+1 = state at time k + 1,

α : discount factor 0 < α < 1

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Dynamic Programming - Markovian Decision Problems (MDP)

System: Controlled Markov chain w/ transition probabilities pij (u)

States: i = 1, . . . , n

Controls: u ∈ U(i)

Cost per stage: g(i, u, j)

Stationary policy: State to control mapping µ; apply µ(i) when at state i

Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑

k=0

αk g(ik , µ(ik ), ik+1

)where

ik = state at time k , ik+1 = state at time k + 1,

α : discount factor 0 < α < 1

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Dynamic Programming - Markovian Decision Problems (MDP)

System: Controlled Markov chain w/ transition probabilities pij (u)

States: i = 1, . . . , n

Controls: u ∈ U(i)

Cost per stage: g(i, u, j)

Stationary policy: State to control mapping µ; apply µ(i) when at state i

Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑

k=0

αk g(ik , µ(ik ), ik+1

)where

ik = state at time k , ik+1 = state at time k + 1,

α : discount factor 0 < α < 1

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Dynamic Programming - Markovian Decision Problems (MDP)

System: Controlled Markov chain w/ transition probabilities pij (u)

States: i = 1, . . . , n

Controls: u ∈ U(i)

Cost per stage: g(i, u, j)

Stationary policy: State to control mapping µ; apply µ(i) when at state i

Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑

k=0

αk g(ik , µ(ik ), ik+1

)where

ik = state at time k , ik+1 = state at time k + 1,

α : discount factor 0 < α < 1

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Dynamic Programming - Markovian Decision Problems (MDP)

System: Controlled Markov chain w/ transition probabilities pij (u)

States: i = 1, . . . , n

Controls: u ∈ U(i)

Cost per stage: g(i, u, j)

Stationary policy: State to control mapping µ; apply µ(i) when at state i

Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑

k=0

αk g(ik , µ(ik ), ik+1

)where

ik = state at time k , ik+1 = state at time k + 1,

α : discount factor 0 < α < 1

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Shorthand Notation - Fixed Point View

Denote by T and Tµ the mappings that map J ∈ <n to the vectors TJand TµJ with components

(TJ)(i) def= min

u∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ(j)

), i = 1, . . . , n,

and

(TµJ)(i) def=

n∑j=1

pij(µ(i)

)(g(i, µ(i), j) + αJ(j)

), i = 1, . . . , n

Bellman’s equations are written as

J∗ = TJ∗ = minµ

TµJ∗, Jµ = TµJµ

Key structure: T and Tµ are sup-norm contractions with modulus α,

‖TJ−TJ ′‖∞ = maxi=1,...,n

∣∣(TJ)(i)−(TJ ′)(i)∣∣ ≤ α max

i=1,...,n

∣∣J(i)−J ′(i)∣∣ = α‖J−J ′‖∞

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Shorthand Notation - Fixed Point View

Denote by T and Tµ the mappings that map J ∈ <n to the vectors TJand TµJ with components

(TJ)(i) def= min

u∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ(j)

), i = 1, . . . , n,

and

(TµJ)(i) def=

n∑j=1

pij(µ(i)

)(g(i, µ(i), j) + αJ(j)

), i = 1, . . . , n

Bellman’s equations are written as

J∗ = TJ∗ = minµ

TµJ∗, Jµ = TµJµ

Key structure: T and Tµ are sup-norm contractions with modulus α,

‖TJ−TJ ′‖∞ = maxi=1,...,n

∣∣(TJ)(i)−(TJ ′)(i)∣∣ ≤ α max

i=1,...,n

∣∣J(i)−J ′(i)∣∣ = α‖J−J ′‖∞

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Shorthand Notation - Fixed Point View

Denote by T and Tµ the mappings that map J ∈ <n to the vectors TJand TµJ with components

(TJ)(i) def= min

u∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ(j)

), i = 1, . . . , n,

and

(TµJ)(i) def=

n∑j=1

pij(µ(i)

)(g(i, µ(i), j) + αJ(j)

), i = 1, . . . , n

Bellman’s equations are written as

J∗ = TJ∗ = minµ

TµJ∗, Jµ = TµJµ

Key structure: T and Tµ are sup-norm contractions with modulus α,

‖TJ−TJ ′‖∞ = maxi=1,...,n

∣∣(TJ)(i)−(TJ ′)(i)∣∣ ≤ α max

i=1,...,n

∣∣J(i)−J ′(i)∣∣ = α‖J−J ′‖∞

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major (Synchronous) Methods for Finding Fixed Point of T

Value iteration (generic fixed point method): Start with any J0, iterate by

J t+1 = TJ t

Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:

Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)

Both methods converge to J∗:

Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ

Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major (Synchronous) Methods for Finding Fixed Point of T

Value iteration (generic fixed point method): Start with any J0, iterate by

J t+1 = TJ t

Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:

Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)

Both methods converge to J∗:

Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ

Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major (Synchronous) Methods for Finding Fixed Point of T

Value iteration (generic fixed point method): Start with any J0, iterate by

J t+1 = TJ t

Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:

Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)

Both methods converge to J∗:

Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ

Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major (Synchronous) Methods for Finding Fixed Point of T

Value iteration (generic fixed point method): Start with any J0, iterate by

J t+1 = TJ t

Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:

Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)

Both methods converge to J∗:

Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ

Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major (Synchronous) Methods for Finding Fixed Point of T

Value iteration (generic fixed point method): Start with any J0, iterate by

J t+1 = TJ t

Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:

Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)

Both methods converge to J∗:

Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ

Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Graphical Interpretation of Policy Iteration

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

M =�(u,w) | there exists x ∈ X

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

M =�(u,w) | there exists x ∈ X

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

M =�(u,w) | there exists x ∈ X

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

M =�(u, w) | there exists x ∈ X

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

M =�(u, w) | there exists x ∈ X

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1

Cost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1

Policy Improvement Exact Policy Evaluation Approximate PolicyEvaluation

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→ (J1, µ1)

TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation Approx. Policy Evalu-ation

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

1

Policy Improvement

Policy Evaluation

Fixed Point

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Failure of Asynchronous Policy Iteration: W-B Example

Counterexample by Williams and Baird (1993)

Deterministic discounted MDP with 6 states arranged in a circle

2 controls available in half the states, 1 control available in the other half

Policy evaluations and improvements are one state at a time, no “delays"

A cycle of 15 iterations is constructed that repeats the initial conditions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Failure of Asynchronous Policy Iteration: W-B Example

Counterexample by Williams and Baird (1993)

Deterministic discounted MDP with 6 states arranged in a circle

2 controls available in half the states, 1 control available in the other half

Policy evaluations and improvements are one state at a time, no “delays"

A cycle of 15 iterations is constructed that repeats the initial conditions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Failure of Asynchronous Policy Iteration: W-B Example

Counterexample by Williams and Baird (1993)

Deterministic discounted MDP with 6 states arranged in a circle

2 controls available in half the states, 1 control available in the other half

Policy evaluations and improvements are one state at a time, no “delays"

A cycle of 15 iterations is constructed that repeats the initial conditions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Failure of Asynchronous Policy Iteration: W-B Example

Counterexample by Williams and Baird (1993)

Deterministic discounted MDP with 6 states arranged in a circle

2 controls available in half the states, 1 control available in the other half

Policy evaluations and improvements are one state at a time, no “delays"

A cycle of 15 iterations is constructed that repeats the initial conditions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Failure of Asynchronous Policy Iteration: W-B Example

Counterexample by Williams and Baird (1993)

Deterministic discounted MDP with 6 states arranged in a circle

2 controls available in half the states, 1 control available in the other half

Policy evaluations and improvements are one state at a time, no “delays"

A cycle of 15 iterations is constructed that repeats the initial conditions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Outline

1 Asynchronous Computation of Fixed Points

2 Value and Policy Iteration for Discounted MDP

3 New Asynchronous Policy Iteration

4 Generalizations

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Rectifying the Difficulty - Q-Factors

Q-factors, Q(i, u) are functions of state-control pairs (i, u)

The optimal Q-factors are given by

Q∗(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αJ∗(j)

)Q∗(i, u): Cost of starting at i , using u first, then use optimal policy.

Bellman’s equation J∗ = TJ∗ is written as

J∗(i) = minu∈U(i)

Q∗(i, u)

These two equations constitute a fixed point equation in (Q, J)

For a fixed policy µ, the corresponding fixed point equation in (Qµ, Jµ) is

Qµ(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αQµ(j, µ(j))

)Jµ(i) = Qµ(i, µ(i))

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Rectifying the Difficulty - Q-Factors

Q-factors, Q(i, u) are functions of state-control pairs (i, u)

The optimal Q-factors are given by

Q∗(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αJ∗(j)

)Q∗(i, u): Cost of starting at i , using u first, then use optimal policy.

Bellman’s equation J∗ = TJ∗ is written as

J∗(i) = minu∈U(i)

Q∗(i, u)

These two equations constitute a fixed point equation in (Q, J)

For a fixed policy µ, the corresponding fixed point equation in (Qµ, Jµ) is

Qµ(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αQµ(j, µ(j))

)Jµ(i) = Qµ(i, µ(i))

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Rectifying the Difficulty - Q-Factors

Q-factors, Q(i, u) are functions of state-control pairs (i, u)

The optimal Q-factors are given by

Q∗(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αJ∗(j)

)Q∗(i, u): Cost of starting at i , using u first, then use optimal policy.

Bellman’s equation J∗ = TJ∗ is written as

J∗(i) = minu∈U(i)

Q∗(i, u)

These two equations constitute a fixed point equation in (Q, J)

For a fixed policy µ, the corresponding fixed point equation in (Qµ, Jµ) is

Qµ(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αQµ(j, µ(j))

)Jµ(i) = Qµ(i, µ(i))

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Rectifying the Difficulty - Q-Factors

Q-factors, Q(i, u) are functions of state-control pairs (i, u)

The optimal Q-factors are given by

Q∗(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αJ∗(j)

)Q∗(i, u): Cost of starting at i , using u first, then use optimal policy.

Bellman’s equation J∗ = TJ∗ is written as

J∗(i) = minu∈U(i)

Q∗(i, u)

These two equations constitute a fixed point equation in (Q, J)

For a fixed policy µ, the corresponding fixed point equation in (Qµ, Jµ) is

Qµ(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αQµ(j, µ(j))

)Jµ(i) = Qµ(i, µ(i))

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Sup-Norm Uniform Contraction Property

Consider Q-factors Q(i, u) and costs J(i). For any µ, define mapping

(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)

)where

Fµ(Q, J)(i, u)def=

n∑j=1

pij (u)(g(i, u, j) + αmin

{J(j),Q(j, µ(j))}

),

Mµ(Q, J)(i) def= min

u∈U(i)Fµ(Q, J)(i, u)

Key fact: A sup-norm contraction w/ common fixed point (Q∗, J∗) for all µ

max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞

}≤ αmax

{‖Q−Q∗‖∞, ‖J−J∗‖∞

}The asynchronous iteration for the fixed point problem

Q := Fµ(Q, J), J := Mµ(Q, J), µ := arbitrary

is convergent.

Even though we operate with different mappings Fµ and Mµ, they allhave a common fixed point.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Sup-Norm Uniform Contraction Property

Consider Q-factors Q(i, u) and costs J(i). For any µ, define mapping

(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)

)where

Fµ(Q, J)(i, u)def=

n∑j=1

pij (u)(g(i, u, j) + αmin

{J(j),Q(j, µ(j))}

),

Mµ(Q, J)(i) def= min

u∈U(i)Fµ(Q, J)(i, u)

Key fact: A sup-norm contraction w/ common fixed point (Q∗, J∗) for all µ

max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞

}≤ αmax

{‖Q−Q∗‖∞, ‖J−J∗‖∞

}The asynchronous iteration for the fixed point problem

Q := Fµ(Q, J), J := Mµ(Q, J), µ := arbitrary

is convergent.

Even though we operate with different mappings Fµ and Mµ, they allhave a common fixed point.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Sup-Norm Uniform Contraction Property

Consider Q-factors Q(i, u) and costs J(i). For any µ, define mapping

(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)

)where

Fµ(Q, J)(i, u)def=

n∑j=1

pij (u)(g(i, u, j) + αmin

{J(j),Q(j, µ(j))}

),

Mµ(Q, J)(i) def= min

u∈U(i)Fµ(Q, J)(i, u)

Key fact: A sup-norm contraction w/ common fixed point (Q∗, J∗) for all µ

max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞

}≤ αmax

{‖Q−Q∗‖∞, ‖J−J∗‖∞

}The asynchronous iteration for the fixed point problem

Q := Fµ(Q, J), J := Mµ(Q, J), µ := arbitrary

is convergent.

Even though we operate with different mappings Fµ and Mµ, they allhave a common fixed point.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Sup-Norm Uniform Contraction Property

Consider Q-factors Q(i, u) and costs J(i). For any µ, define mapping

(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)

)where

Fµ(Q, J)(i, u)def=

n∑j=1

pij (u)(g(i, u, j) + αmin

{J(j),Q(j, µ(j))}

),

Mµ(Q, J)(i) def= min

u∈U(i)Fµ(Q, J)(i, u)

Key fact: A sup-norm contraction w/ common fixed point (Q∗, J∗) for all µ

max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞

}≤ αmax

{‖Q−Q∗‖∞, ‖J−J∗‖∞

}The asynchronous iteration for the fixed point problem

Q := Fµ(Q, J), J := Mµ(Q, J), µ := arbitrary

is convergent.

Even though we operate with different mappings Fµ and Mµ, they allhave a common fixed point.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Policy Iteration: A More Detailed View

Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where

V t (i) = Qt(i, µt (i))

≈ Q-factors of current policy

Processor i at time t does one of three things:

Does nothingOr operates with Fµt at i :

V t+1(i) =n∑

j=1

pij (µt (i))

(g(i, µt (i), j) + αmin

{J t (j),V t (j)}

)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αmin

{J t (j),V t (j)}

)and sets µt+1(i) to a control that attains the minimum.

Convergence follows by the asynchronous convergence theorem.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Policy Iteration: A More Detailed View

Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where

V t (i) = Qt(i, µt (i))

≈ Q-factors of current policy

Processor i at time t does one of three things:

Does nothingOr operates with Fµt at i :

V t+1(i) =n∑

j=1

pij (µt (i))

(g(i, µt (i), j) + αmin

{J t (j),V t (j)}

)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αmin

{J t (j),V t (j)}

)and sets µt+1(i) to a control that attains the minimum.

Convergence follows by the asynchronous convergence theorem.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Policy Iteration: A More Detailed View

Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where

V t (i) = Qt(i, µt (i))

≈ Q-factors of current policy

Processor i at time t does one of three things:

Does nothingOr operates with Fµt at i :

V t+1(i) =n∑

j=1

pij (µt (i))

(g(i, µt (i), j) + αmin

{J t (j),V t (j)}

)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αmin

{J t (j),V t (j)}

)and sets µt+1(i) to a control that attains the minimum.

Convergence follows by the asynchronous convergence theorem.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Policy Iteration: A More Detailed View

Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where

V t (i) = Qt(i, µt (i))

≈ Q-factors of current policy

Processor i at time t does one of three things:

Does nothingOr operates with Fµt at i :

V t+1(i) =n∑

j=1

pij (µt (i))

(g(i, µt (i), j) + αmin

{J t (j),V t (j)}

)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αmin

{J t (j),V t (j)}

)and sets µt+1(i) to a control that attains the minimum.

Convergence follows by the asynchronous convergence theorem.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Policy Iteration: A More Detailed View

Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where

V t (i) = Qt(i, µt (i))

≈ Q-factors of current policy

Processor i at time t does one of three things:

Does nothingOr operates with Fµt at i :

V t+1(i) =n∑

j=1

pij (µt (i))

(g(i, µt (i), j) + αmin

{J t (j),V t (j)}

)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αmin

{J t (j),V t (j)}

)and sets µt+1(i) to a control that attains the minimum.

Convergence follows by the asynchronous convergence theorem.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Policy Iteration: A More Detailed View

Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where

V t (i) = Qt(i, µt (i))

≈ Q-factors of current policy

Processor i at time t does one of three things:

Does nothingOr operates with Fµt at i :

V t+1(i) =n∑

j=1

pij (µt (i))

(g(i, µt (i), j) + αmin

{J t (j),V t (j)}

)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αmin

{J t (j),V t (j)}

)and sets µt+1(i) to a control that attains the minimum.

Convergence follows by the asynchronous convergence theorem.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Williams and Baird Counterexample

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Outline

1 Asynchronous Computation of Fixed Points

2 Value and Policy Iteration for Discounted MDP

3 New Asynchronous Policy Iteration

4 Generalizations

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Fixed Points of Concave Sup-Norm Contractions

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Find a fixed point of a mapping T : <n 7→ <n, where the componentsTi (·) : <n 7→ < are concave sup-norm contractions

T is the minimum of a collection of linear mappings Tµ involving theconcave conjugate functions of the concave functions Ti

An asynchronous parametric fixed point algorithm can be used:Linearize at the current JIterate using the linearized mappingUpdate the linearization

All of the above are done asynchronously

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Fixed Points of Concave Sup-Norm Contractions

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Find a fixed point of a mapping T : <n 7→ <n, where the componentsTi (·) : <n 7→ < are concave sup-norm contractions

T is the minimum of a collection of linear mappings Tµ involving theconcave conjugate functions of the concave functions Ti

An asynchronous parametric fixed point algorithm can be used:Linearize at the current JIterate using the linearized mappingUpdate the linearization

All of the above are done asynchronously

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Fixed Points of Concave Sup-Norm Contractions

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Find a fixed point of a mapping T : <n 7→ <n, where the componentsTi (·) : <n 7→ < are concave sup-norm contractions

T is the minimum of a collection of linear mappings Tµ involving theconcave conjugate functions of the concave functions Ti

An asynchronous parametric fixed point algorithm can be used:Linearize at the current JIterate using the linearized mappingUpdate the linearization

All of the above are done asynchronously

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Fixed Points of Concave Sup-Norm Contractions

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Find a fixed point of a mapping T : <n 7→ <n, where the componentsTi (·) : <n 7→ < are concave sup-norm contractions

T is the minimum of a collection of linear mappings Tµ involving theconcave conjugate functions of the concave functions Ti

An asynchronous parametric fixed point algorithm can be used:Linearize at the current JIterate using the linearized mappingUpdate the linearization

All of the above are done asynchronously

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Concluding Remarks

Asynchronous policy iteration has fragile convergence properties

We have provided a new approach to correct the difficulties

Key idea: Embed the problem into one that involves a uniform sup-normcontraction

Extensions to generalized DP models involving:Other DP models (e.g., stochastic shortest path, games, etc)Fixed point problems for nonDP mappings of the form T = minµ∈MTµ,such as concave sup-norm contractions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Concluding Remarks

Asynchronous policy iteration has fragile convergence properties

We have provided a new approach to correct the difficulties

Key idea: Embed the problem into one that involves a uniform sup-normcontraction

Extensions to generalized DP models involving:Other DP models (e.g., stochastic shortest path, games, etc)Fixed point problems for nonDP mappings of the form T = minµ∈MTµ,such as concave sup-norm contractions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Concluding Remarks

Asynchronous policy iteration has fragile convergence properties

We have provided a new approach to correct the difficulties

Key idea: Embed the problem into one that involves a uniform sup-normcontraction

Extensions to generalized DP models involving:Other DP models (e.g., stochastic shortest path, games, etc)Fixed point problems for nonDP mappings of the form T = minµ∈MTµ,such as concave sup-norm contractions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Concluding Remarks

Asynchronous policy iteration has fragile convergence properties

We have provided a new approach to correct the difficulties

Key idea: Embed the problem into one that involves a uniform sup-normcontraction

Extensions to generalized DP models involving:Other DP models (e.g., stochastic shortest path, games, etc)Fixed point problems for nonDP mappings of the form T = minµ∈MTµ,such as concave sup-norm contractions

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Thank You!