Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed...

125
Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations Distributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology PCO ’13 Cambridge, MA May 2013 Joint Work with Huizhen (Janey) Yu

Transcript of Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed...

Page 1: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Asynchronous Computation of Fixed Points andApplications in Dynamic Programming

Dimitri P. Bertsekas

Laboratory for Information and Decision SystemsMassachusetts Institute of Technology

PCO ’13Cambridge, MA

May 2013

Joint Work with Huizhen (Janey) Yu

Page 2: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Page 3: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Page 4: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Page 5: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Page 6: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Page 7: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Page 8: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Page 9: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Page 10: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

References

Current work

D. P. Bertsekas and H. Yu, “Q-Learning and Enhanced Policy Iteration inDiscounted Dynamic Programming," Math. of OR, 2012

H. Yu and D. P. Bertsekas, "Q-Learning and Policy Iteration Algorithmsfor Stochastic Shortest Path Problems," to appear in Annals of OR

D. P. Bertsekas and H. Yu, “Distributed Asynchronous Policy Iteration,"Proc. Allerton Conference, Sept. 2010

More works in progress

Connection to early work on theory of totally asynchronous algorithms

D. P. Bertsekas, "Distributed Dynamic Programming," IEEE Transactionson Aut. Control, Vol. AC-27, 1982

D. P. Bertsekas, "Distributed Asynchronous Computation of FixedPoints," Mathematical Programming, Vol. 27, 1983

D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation:Numerical Methods, Prentice-Hall, 1989

Page 11: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

References

Current work

D. P. Bertsekas and H. Yu, “Q-Learning and Enhanced Policy Iteration inDiscounted Dynamic Programming," Math. of OR, 2012

H. Yu and D. P. Bertsekas, "Q-Learning and Policy Iteration Algorithmsfor Stochastic Shortest Path Problems," to appear in Annals of OR

D. P. Bertsekas and H. Yu, “Distributed Asynchronous Policy Iteration,"Proc. Allerton Conference, Sept. 2010

More works in progress

Connection to early work on theory of totally asynchronous algorithms

D. P. Bertsekas, "Distributed Dynamic Programming," IEEE Transactionson Aut. Control, Vol. AC-27, 1982

D. P. Bertsekas, "Distributed Asynchronous Computation of FixedPoints," Mathematical Programming, Vol. 27, 1983

D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation:Numerical Methods, Prentice-Hall, 1989

Page 12: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Outline

1 Asynchronous Computation of Fixed Points

2 Value and Policy Iteration for Discounted MDP

3 New Asynchronous Policy Iteration

4 Generalizations

Page 13: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Outline

1 Asynchronous Computation of Fixed Points

2 Value and Policy Iteration for Discounted MDP

3 New Asynchronous Policy Iteration

4 Generalizations

Page 14: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm

Routing algorithm of the ARPANET (1969)

Based on the Bellman-Ford shortest path algorithm

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j

Original ambitious implementation included congestion dependent aij

Failed miserably because of violent oscillations

The algorithm was “stabilized" by making aij essentially constant (1974)

Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)

Page 15: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm

Routing algorithm of the ARPANET (1969)

Based on the Bellman-Ford shortest path algorithm

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j

Original ambitious implementation included congestion dependent aij

Failed miserably because of violent oscillations

The algorithm was “stabilized" by making aij essentially constant (1974)

Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)

Page 16: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm

Routing algorithm of the ARPANET (1969)

Based on the Bellman-Ford shortest path algorithm

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j

Original ambitious implementation included congestion dependent aij

Failed miserably because of violent oscillations

The algorithm was “stabilized" by making aij essentially constant (1974)

Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)

Page 17: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm

Routing algorithm of the ARPANET (1969)

Based on the Bellman-Ford shortest path algorithm

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j

Original ambitious implementation included congestion dependent aij

Failed miserably because of violent oscillations

The algorithm was “stabilized" by making aij essentially constant (1974)

Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)

Page 18: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm

Routing algorithm of the ARPANET (1969)

Based on the Bellman-Ford shortest path algorithm

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j

Original ambitious implementation included congestion dependent aij

Failed miserably because of violent oscillations

The algorithm was “stabilized" by making aij essentially constant (1974)

Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)

Page 19: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Page 20: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Page 21: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Page 22: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Page 23: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Page 24: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Page 25: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Page 26: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed "Totally" Asynchronous Framework for Fixed PointComputation

Consider solution of general fixed point problem J = TJ, or

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n

Asynchronous Fixed Point Algorithm

Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i

Update of processor i [with “delays" t − τij (t)]

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

J t (i) if t /∈ Ti .

Page 27: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed "Totally" Asynchronous Framework for Fixed PointComputation

Consider solution of general fixed point problem J = TJ, or

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n

Asynchronous Fixed Point Algorithm

Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i

Update of processor i [with “delays" t − τij (t)]

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

J t (i) if t /∈ Ti .

Page 28: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed "Totally" Asynchronous Framework for Fixed PointComputation

Consider solution of general fixed point problem J = TJ, or

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n

Asynchronous Fixed Point Algorithm

Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i

Update of processor i [with “delays" t − τij (t)]

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

J t (i) if t /∈ Ti .

Page 29: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed "Totally" Asynchronous Framework for Fixed PointComputation

Consider solution of general fixed point problem J = TJ, or

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n

Asynchronous Fixed Point Algorithm

Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i

Update of processor i [with “delays" t − τij (t)]

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

J t (i) if t /∈ Ti .

Page 30: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed "Totally" Asynchronous Framework for Fixed PointComputation

Consider solution of general fixed point problem J = TJ, or

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n

Asynchronous Fixed Point Algorithm

Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i

Update of processor i [with “delays" t − τij (t)]

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

J t (i) if t /∈ Ti .

Page 31: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Convergence of Fixed Point Iterations

A general theorem for “totally asynchronous" iterations, i.e., Ti are infinitesets and τij (t)→∞ as t →∞ (Bertsekas, 1983)

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Assume there is a nested sequence of sets S(k + 1) ⊂ S(k) such that(Synchronous Convergence Condition) We have

TJ ∈ S(k + 1), ∀ J ∈ S(k),

and the limit points of all sequences {Jk} with Jk ∈ S(k), for all k , are fixedpoints of T .(Box Condition) S(k) is a Cartesian product:

S(k) = S1(k)× · · · × Sn(k)

Then, if J0 ∈ S(0), every limit point of {J t} is a fixed point of T .

Page 32: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Convergence of Fixed Point Iterations

A general theorem for “totally asynchronous" iterations, i.e., Ti are infinitesets and τij (t)→∞ as t →∞ (Bertsekas, 1983)

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Assume there is a nested sequence of sets S(k + 1) ⊂ S(k) such that(Synchronous Convergence Condition) We have

TJ ∈ S(k + 1), ∀ J ∈ S(k),

and the limit points of all sequences {Jk} with Jk ∈ S(k), for all k , are fixedpoints of T .(Box Condition) S(k) is a Cartesian product:

S(k) = S1(k)× · · · × Sn(k)

Then, if J0 ∈ S(0), every limit point of {J t} is a fixed point of T .

Page 33: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Convergence of Fixed Point Iterations

A general theorem for “totally asynchronous" iterations, i.e., Ti are infinitesets and τij (t)→∞ as t →∞ (Bertsekas, 1983)

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Assume there is a nested sequence of sets S(k + 1) ⊂ S(k) such that(Synchronous Convergence Condition) We have

TJ ∈ S(k + 1), ∀ J ∈ S(k),

and the limit points of all sequences {Jk} with Jk ∈ S(k), for all k , are fixedpoints of T .(Box Condition) S(k) is a Cartesian product:

S(k) = S1(k)× · · · × Sn(k)

Then, if J0 ∈ S(0), every limit point of {J t} is a fixed point of T .

Page 34: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Convergence of Fixed Point Iterations

A general theorem for “totally asynchronous" iterations, i.e., Ti are infinitesets and τij (t)→∞ as t →∞ (Bertsekas, 1983)

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Assume there is a nested sequence of sets S(k + 1) ⊂ S(k) such that(Synchronous Convergence Condition) We have

TJ ∈ S(k + 1), ∀ J ∈ S(k),

and the limit points of all sequences {Jk} with Jk ∈ S(k), for all k , are fixedpoints of T .(Box Condition) S(k) is a Cartesian product:

S(k) = S1(k)× · · · × Sn(k)

Then, if J0 ∈ S(0), every limit point of {J t} is a fixed point of T .

Page 35: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Applications of the Theorem

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Major contexts where the theorem applies:

T is a sup-norm contraction with fixed point J∗ and modulus α:

S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B

T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,

S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with

J ≤ TJ ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.

Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)

Page 36: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Applications of the Theorem

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Major contexts where the theorem applies:

T is a sup-norm contraction with fixed point J∗ and modulus α:

S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B

T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,

S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with

J ≤ T J ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.

Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)

Page 37: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Applications of the Theorem

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Major contexts where the theorem applies:

T is a sup-norm contraction with fixed point J∗ and modulus α:

S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B

T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,

S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with

J ≤ T J ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.

Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)

Page 38: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Applications of the Theorem

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Major contexts where the theorem applies:

T is a sup-norm contraction with fixed point J∗ and modulus α:

S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B

T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,

S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with

J ≤ T J ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.

Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)

Page 39: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Applications of the Theorem

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Major contexts where the theorem applies:

T is a sup-norm contraction with fixed point J∗ and modulus α:

S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B

T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,

S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with

J ≤ T J ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.

Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)

Page 40: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Convergence Mechanism

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

J1 Iterations J2 Iteration

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

J1 Iterations J2 Iteration

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

1

Once we are in S(k), iteration of any one component (say J1) keeps usin the current box

Once all components have been iterated once, we pass into the next boxS(k + 1) (permanently)

Page 41: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Convergence Mechanism

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

J1 Iterations J2 Iteration

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

J1 Iterations J2 Iteration

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

1

Once we are in S(k), iteration of any one component (say J1) keeps usin the current box

Once all components have been iterated once, we pass into the next boxS(k + 1) (permanently)

Page 42: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Finding Fixed Points of Parametric Mappings

Problem: Find a fixed point J∗ of a mapping T : <n 7→ <n of the form

(TJ)(i) = minµ∈Mi

(TµJ)(i), i = 1, . . . , n,

where µ is a parameter from some setMi .

Common idea (central in DP policy iteration): Instead of T , we iteratewith some Tµ, then change µ once in a while

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Page 43: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Finding Fixed Points of Parametric Mappings

Problem: Find a fixed point J∗ of a mapping T : <n 7→ <n of the form

(TJ)(i) = minµ∈Mi

(TµJ)(i), i = 1, . . . , n,

where µ is a parameter from some setMi .

Common idea (central in DP policy iteration): Instead of T , we iteratewith some Tµ, then change µ once in a while

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Page 44: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Asynchronous Parametric Fixed Point Iterations

Partition J and µ into components, J(1), . . . , J(n) and µ(1), . . . , µ(n)

Asynchronous Algorithm

Processor i , at each time does one of three things:Does nothing orUpdates J(i) by setting it to

J(i) := (Tµ(i)J)(i)

orUpdates J(i) and µ(i) by setting

(TJ)(i) := minµ∈Mi

(TµJ)(i)

andµ(i) := arg min

µ∈Mi(TµJ)(i)

A very chaotic environment!

Page 45: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Asynchronous Parametric Fixed Point Iterations

Partition J and µ into components, J(1), . . . , J(n) and µ(1), . . . , µ(n)

Asynchronous Algorithm

Processor i , at each time does one of three things:Does nothing orUpdates J(i) by setting it to

J(i) := (Tµ(i)J)(i)

orUpdates J(i) and µ(i) by setting

(TJ)(i) := minµ∈Mi

(TµJ)(i)

andµ(i) := arg min

µ∈Mi(TµJ)(i)

A very chaotic environment!

Page 46: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Difficulty with Parametric Fixed Point Algorithm

J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

: Fixed Point of T

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

: Fixed Point of

Distributed Asynchronous Computation of Fixed Points Classical Value and Policy Iteration for Discounted MDP Distributed Asynchronous Policy Iteration Interlocking Research Directions - Generalizations

A More Abstract View of What Follows

We want to find a fixed point J∗ of a mapping T : �n �→ �n of the form

(TJ)(i) = minµ∈Mi

(TµJ)(i), i = 1, . . . , n,

where µ is a parameter from some set Mi .

We update J in two ways:

Iterate with T : J �→ TJ (cf. value iteration/DP), ORPick a µ and iterate with Tµ: J �→ TµJ (cf. policy evaluation/DP)

Difficulty: Tµ has different fixed point than T ... so iterations with Tµ aimat a target other than J∗

Our key idea (abstractly): Embed both T and Tµ within another (uniform)contraction mapping Fµ that has the same fixed point for all µ

The uniform contraction mapping Fµ operates on the larger space ofQ-factors

In the DP context, Fµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Tµ has different fixed point than T ... so the target of the iterations keepschanging

A (restrictive) convergence result: If each Tµ is both a sup-normcontraction and is monotone, convergence can be shown assuming thatTµ0 J0 ≤ J0

Without the condition Tµ0 J0 ≤ J0, the synchronous algorithm convergesbut the asynchronous algorithm may oscillate

Page 47: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Difficulty with Parametric Fixed Point Algorithm

J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

: Fixed Point of T

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

: Fixed Point of

Distributed Asynchronous Computation of Fixed Points Classical Value and Policy Iteration for Discounted MDP Distributed Asynchronous Policy Iteration Interlocking Research Directions - Generalizations

A More Abstract View of What Follows

We want to find a fixed point J∗ of a mapping T : �n �→ �n of the form

(TJ)(i) = minµ∈Mi

(TµJ)(i), i = 1, . . . , n,

where µ is a parameter from some set Mi .

We update J in two ways:

Iterate with T : J �→ TJ (cf. value iteration/DP), ORPick a µ and iterate with Tµ: J �→ TµJ (cf. policy evaluation/DP)

Difficulty: Tµ has different fixed point than T ... so iterations with Tµ aimat a target other than J∗

Our key idea (abstractly): Embed both T and Tµ within another (uniform)contraction mapping Fµ that has the same fixed point for all µ

The uniform contraction mapping Fµ operates on the larger space ofQ-factors

In the DP context, Fµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Tµ has different fixed point than T ... so the target of the iterations keepschanging

A (restrictive) convergence result: If each Tµ is both a sup-normcontraction and is monotone, convergence can be shown assuming thatTµ0 J0 ≤ J0

Without the condition Tµ0 J0 ≤ J0, the synchronous algorithm convergesbut the asynchronous algorithm may oscillate

Page 48: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Difficulty with Parametric Fixed Point Algorithm

J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

: Fixed Point of T

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1

: Fixed Point of

Distributed Asynchronous Computation of Fixed Points Classical Value and Policy Iteration for Discounted MDP Distributed Asynchronous Policy Iteration Interlocking Research Directions - Generalizations

A More Abstract View of What Follows

We want to find a fixed point J∗ of a mapping T : �n �→ �n of the form

(TJ)(i) = minµ∈Mi

(TµJ)(i), i = 1, . . . , n,

where µ is a parameter from some set Mi .

We update J in two ways:

Iterate with T : J �→ TJ (cf. value iteration/DP), ORPick a µ and iterate with Tµ: J �→ TµJ (cf. policy evaluation/DP)

Difficulty: Tµ has different fixed point than T ... so iterations with Tµ aimat a target other than J∗

Our key idea (abstractly): Embed both T and Tµ within another (uniform)contraction mapping Fµ that has the same fixed point for all µ

The uniform contraction mapping Fµ operates on the larger space ofQ-factors

In the DP context, Fµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Tµ has different fixed point than T ... so the target of the iterations keepschanging

A (restrictive) convergence result: If each Tµ is both a sup-normcontraction and is monotone, convergence can be shown assuming thatTµ0 J0 ≤ J0

Without the condition Tµ0 J0 ≤ J0, the synchronous algorithm convergesbut the asynchronous algorithm may oscillate

Page 49: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Addressing the Difficulty: A High-Level View

Assume that each Tµ is a sup-norm contraction

Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ

The uniform contraction mapping Gµ operates on a larger space

In the DP context, Gµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Page 50: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Addressing the Difficulty: A High-Level View

Assume that each Tµ is a sup-norm contraction

Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ

The uniform contraction mapping Gµ operates on a larger space

In the DP context, Gµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Page 51: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Addressing the Difficulty: A High-Level View

Assume that each Tµ is a sup-norm contraction

Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ

The uniform contraction mapping Gµ operates on a larger space

In the DP context, Gµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Page 52: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Addressing the Difficulty: A High-Level View

Assume that each Tµ is a sup-norm contraction

Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ

The uniform contraction mapping Gµ operates on a larger space

In the DP context, Gµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Page 53: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Addressing the Difficulty: A High-Level View

Assume that each Tµ is a sup-norm contraction

Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ

The uniform contraction mapping Gµ operates on a larger space

In the DP context, Gµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Page 54: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Outline

1 Asynchronous Computation of Fixed Points

2 Value and Policy Iteration for Discounted MDP

3 New Asynchronous Policy Iteration

4 Generalizations

Page 55: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Dynamic Programming - Markovian Decision Problems (MDP)

System: Controlled Markov chain w/ transition probabilities pij (u)

States: i = 1, . . . , n

Controls: u ∈ U(i)

Cost per stage: g(i, u, j)

Stationary policy: State to control mapping µ; apply µ(i) when at state i

Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑

k=0

αk g(ik , µ(ik ), ik+1

)where

ik = state at time k , ik+1 = state at time k + 1,

α : discount factor 0 < α < 1

Page 56: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Dynamic Programming - Markovian Decision Problems (MDP)

System: Controlled Markov chain w/ transition probabilities pij (u)

States: i = 1, . . . , n

Controls: u ∈ U(i)

Cost per stage: g(i, u, j)

Stationary policy: State to control mapping µ; apply µ(i) when at state i

Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑

k=0

αk g(ik , µ(ik ), ik+1

)where

ik = state at time k , ik+1 = state at time k + 1,

α : discount factor 0 < α < 1

Page 57: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Dynamic Programming - Markovian Decision Problems (MDP)

System: Controlled Markov chain w/ transition probabilities pij (u)

States: i = 1, . . . , n

Controls: u ∈ U(i)

Cost per stage: g(i, u, j)

Stationary policy: State to control mapping µ; apply µ(i) when at state i

Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑

k=0

αk g(ik , µ(ik ), ik+1

)where

ik = state at time k , ik+1 = state at time k + 1,

α : discount factor 0 < α < 1

Page 58: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Dynamic Programming - Markovian Decision Problems (MDP)

System: Controlled Markov chain w/ transition probabilities pij (u)

States: i = 1, . . . , n

Controls: u ∈ U(i)

Cost per stage: g(i, u, j)

Stationary policy: State to control mapping µ; apply µ(i) when at state i

Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑

k=0

αk g(ik , µ(ik ), ik+1

)where

ik = state at time k , ik+1 = state at time k + 1,

α : discount factor 0 < α < 1

Page 59: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Dynamic Programming - Markovian Decision Problems (MDP)

System: Controlled Markov chain w/ transition probabilities pij (u)

States: i = 1, . . . , n

Controls: u ∈ U(i)

Cost per stage: g(i, u, j)

Stationary policy: State to control mapping µ; apply µ(i) when at state i

Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑

k=0

αk g(ik , µ(ik ), ik+1

)where

ik = state at time k , ik+1 = state at time k + 1,

α : discount factor 0 < α < 1

Page 60: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Dynamic Programming - Markovian Decision Problems (MDP)

System: Controlled Markov chain w/ transition probabilities pij (u)

States: i = 1, . . . , n

Controls: u ∈ U(i)

Cost per stage: g(i, u, j)

Stationary policy: State to control mapping µ; apply µ(i) when at state i

Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑

k=0

αk g(ik , µ(ik ), ik+1

)where

ik = state at time k , ik+1 = state at time k + 1,

α : discount factor 0 < α < 1

Page 61: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Page 62: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Page 63: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Page 64: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Page 65: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Page 66: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Page 67: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Page 68: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Page 69: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Shorthand Notation - Fixed Point View

Denote by T and Tµ the mappings that map J ∈ <n to the vectors TJand TµJ with components

(TJ)(i) def= min

u∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ(j)

), i = 1, . . . , n,

and

(TµJ)(i) def=

n∑j=1

pij(µ(i)

)(g(i, µ(i), j) + αJ(j)

), i = 1, . . . , n

Bellman’s equations are written as

J∗ = TJ∗ = minµ

TµJ∗, Jµ = TµJµ

Key structure: T and Tµ are sup-norm contractions with modulus α,

‖TJ−TJ ′‖∞ = maxi=1,...,n

∣∣(TJ)(i)−(TJ ′)(i)∣∣ ≤ α max

i=1,...,n

∣∣J(i)−J ′(i)∣∣ = α‖J−J ′‖∞

Page 70: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Shorthand Notation - Fixed Point View

Denote by T and Tµ the mappings that map J ∈ <n to the vectors TJand TµJ with components

(TJ)(i) def= min

u∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ(j)

), i = 1, . . . , n,

and

(TµJ)(i) def=

n∑j=1

pij(µ(i)

)(g(i, µ(i), j) + αJ(j)

), i = 1, . . . , n

Bellman’s equations are written as

J∗ = TJ∗ = minµ

TµJ∗, Jµ = TµJµ

Key structure: T and Tµ are sup-norm contractions with modulus α,

‖TJ−TJ ′‖∞ = maxi=1,...,n

∣∣(TJ)(i)−(TJ ′)(i)∣∣ ≤ α max

i=1,...,n

∣∣J(i)−J ′(i)∣∣ = α‖J−J ′‖∞

Page 71: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Shorthand Notation - Fixed Point View

Denote by T and Tµ the mappings that map J ∈ <n to the vectors TJand TµJ with components

(TJ)(i) def= min

u∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ(j)

), i = 1, . . . , n,

and

(TµJ)(i) def=

n∑j=1

pij(µ(i)

)(g(i, µ(i), j) + αJ(j)

), i = 1, . . . , n

Bellman’s equations are written as

J∗ = TJ∗ = minµ

TµJ∗, Jµ = TµJµ

Key structure: T and Tµ are sup-norm contractions with modulus α,

‖TJ−TJ ′‖∞ = maxi=1,...,n

∣∣(TJ)(i)−(TJ ′)(i)∣∣ ≤ α max

i=1,...,n

∣∣J(i)−J ′(i)∣∣ = α‖J−J ′‖∞

Page 72: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major (Synchronous) Methods for Finding Fixed Point of T

Value iteration (generic fixed point method): Start with any J0, iterate by

J t+1 = TJ t

Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:

Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)

Both methods converge to J∗:

Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ

Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T

Page 73: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major (Synchronous) Methods for Finding Fixed Point of T

Value iteration (generic fixed point method): Start with any J0, iterate by

J t+1 = TJ t

Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:

Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)

Both methods converge to J∗:

Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ

Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T

Page 74: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major (Synchronous) Methods for Finding Fixed Point of T

Value iteration (generic fixed point method): Start with any J0, iterate by

J t+1 = TJ t

Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:

Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)

Both methods converge to J∗:

Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ

Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T

Page 75: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major (Synchronous) Methods for Finding Fixed Point of T

Value iteration (generic fixed point method): Start with any J0, iterate by

J t+1 = TJ t

Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:

Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)

Both methods converge to J∗:

Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ

Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T

Page 76: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Major (Synchronous) Methods for Finding Fixed Point of T

Value iteration (generic fixed point method): Start with any J0, iterate by

J t+1 = TJ t

Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:

Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)

Both methods converge to J∗:

Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ

Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T

Page 77: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Graphical Interpretation of Policy Iteration

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

M =�(u,w) | there exists x ∈ X

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

M =�(u,w) | there exists x ∈ X

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

M =�(u,w) | there exists x ∈ X

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

M =�(u, w) | there exists x ∈ X

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

M =�(u, w) | there exists x ∈ X

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1

Cost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1

Policy Improvement Exact Policy Evaluation Approximate PolicyEvaluation

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

1

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→ (J1, µ1)

TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation Approx. Policy Evalu-ation

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

1

Policy Improvement

Policy Evaluation

Fixed Point

Page 78: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Failure of Asynchronous Policy Iteration: W-B Example

Counterexample by Williams and Baird (1993)

Deterministic discounted MDP with 6 states arranged in a circle

2 controls available in half the states, 1 control available in the other half

Policy evaluations and improvements are one state at a time, no “delays"

A cycle of 15 iterations is constructed that repeats the initial conditions

Page 79: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Failure of Asynchronous Policy Iteration: W-B Example

Counterexample by Williams and Baird (1993)

Deterministic discounted MDP with 6 states arranged in a circle

2 controls available in half the states, 1 control available in the other half

Policy evaluations and improvements are one state at a time, no “delays"

A cycle of 15 iterations is constructed that repeats the initial conditions

Page 80: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Failure of Asynchronous Policy Iteration: W-B Example

Counterexample by Williams and Baird (1993)

Deterministic discounted MDP with 6 states arranged in a circle

2 controls available in half the states, 1 control available in the other half

Policy evaluations and improvements are one state at a time, no “delays"

A cycle of 15 iterations is constructed that repeats the initial conditions

Page 81: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Failure of Asynchronous Policy Iteration: W-B Example

Counterexample by Williams and Baird (1993)

Deterministic discounted MDP with 6 states arranged in a circle

2 controls available in half the states, 1 control available in the other half

Policy evaluations and improvements are one state at a time, no “delays"

A cycle of 15 iterations is constructed that repeats the initial conditions

Page 82: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Failure of Asynchronous Policy Iteration: W-B Example

Counterexample by Williams and Baird (1993)

Deterministic discounted MDP with 6 states arranged in a circle

2 controls available in half the states, 1 control available in the other half

Policy evaluations and improvements are one state at a time, no “delays"

A cycle of 15 iterations is constructed that repeats the initial conditions

Page 83: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Outline

1 Asynchronous Computation of Fixed Points

2 Value and Policy Iteration for Discounted MDP

3 New Asynchronous Policy Iteration

4 Generalizations

Page 84: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Rectifying the Difficulty - Q-Factors

Q-factors, Q(i, u) are functions of state-control pairs (i, u)

The optimal Q-factors are given by

Q∗(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αJ∗(j)

)Q∗(i, u): Cost of starting at i , using u first, then use optimal policy.

Bellman’s equation J∗ = TJ∗ is written as

J∗(i) = minu∈U(i)

Q∗(i, u)

These two equations constitute a fixed point equation in (Q, J)

For a fixed policy µ, the corresponding fixed point equation in (Qµ, Jµ) is

Qµ(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αQµ(j, µ(j))

)Jµ(i) = Qµ(i, µ(i))

Page 85: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Rectifying the Difficulty - Q-Factors

Q-factors, Q(i, u) are functions of state-control pairs (i, u)

The optimal Q-factors are given by

Q∗(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αJ∗(j)

)Q∗(i, u): Cost of starting at i , using u first, then use optimal policy.

Bellman’s equation J∗ = TJ∗ is written as

J∗(i) = minu∈U(i)

Q∗(i, u)

These two equations constitute a fixed point equation in (Q, J)

For a fixed policy µ, the corresponding fixed point equation in (Qµ, Jµ) is

Qµ(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αQµ(j, µ(j))

)Jµ(i) = Qµ(i, µ(i))

Page 86: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Rectifying the Difficulty - Q-Factors

Q-factors, Q(i, u) are functions of state-control pairs (i, u)

The optimal Q-factors are given by

Q∗(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αJ∗(j)

)Q∗(i, u): Cost of starting at i , using u first, then use optimal policy.

Bellman’s equation J∗ = TJ∗ is written as

J∗(i) = minu∈U(i)

Q∗(i, u)

These two equations constitute a fixed point equation in (Q, J)

For a fixed policy µ, the corresponding fixed point equation in (Qµ, Jµ) is

Qµ(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αQµ(j, µ(j))

)Jµ(i) = Qµ(i, µ(i))

Page 87: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Rectifying the Difficulty - Q-Factors

Q-factors, Q(i, u) are functions of state-control pairs (i, u)

The optimal Q-factors are given by

Q∗(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αJ∗(j)

)Q∗(i, u): Cost of starting at i , using u first, then use optimal policy.

Bellman’s equation J∗ = TJ∗ is written as

J∗(i) = minu∈U(i)

Q∗(i, u)

These two equations constitute a fixed point equation in (Q, J)

For a fixed policy µ, the corresponding fixed point equation in (Qµ, Jµ) is

Qµ(i, u) =n∑

j=1

pij (u)(g(i, u, j) + αQµ(j, µ(j))

)Jµ(i) = Qµ(i, µ(i))

Page 88: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Sup-Norm Uniform Contraction Property

Consider Q-factors Q(i, u) and costs J(i). For any µ, define mapping

(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)

)where

Fµ(Q, J)(i, u)def=

n∑j=1

pij (u)(g(i, u, j) + αmin

{J(j),Q(j, µ(j))}

),

Mµ(Q, J)(i) def= min

u∈U(i)Fµ(Q, J)(i, u)

Key fact: A sup-norm contraction w/ common fixed point (Q∗, J∗) for all µ

max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞

}≤ αmax

{‖Q−Q∗‖∞, ‖J−J∗‖∞

}The asynchronous iteration for the fixed point problem

Q := Fµ(Q, J), J := Mµ(Q, J), µ := arbitrary

is convergent.

Even though we operate with different mappings Fµ and Mµ, they allhave a common fixed point.

Page 89: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Sup-Norm Uniform Contraction Property

Consider Q-factors Q(i, u) and costs J(i). For any µ, define mapping

(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)

)where

Fµ(Q, J)(i, u)def=

n∑j=1

pij (u)(g(i, u, j) + αmin

{J(j),Q(j, µ(j))}

),

Mµ(Q, J)(i) def= min

u∈U(i)Fµ(Q, J)(i, u)

Key fact: A sup-norm contraction w/ common fixed point (Q∗, J∗) for all µ

max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞

}≤ αmax

{‖Q−Q∗‖∞, ‖J−J∗‖∞

}The asynchronous iteration for the fixed point problem

Q := Fµ(Q, J), J := Mµ(Q, J), µ := arbitrary

is convergent.

Even though we operate with different mappings Fµ and Mµ, they allhave a common fixed point.

Page 90: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Sup-Norm Uniform Contraction Property

Consider Q-factors Q(i, u) and costs J(i). For any µ, define mapping

(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)

)where

Fµ(Q, J)(i, u)def=

n∑j=1

pij (u)(g(i, u, j) + αmin

{J(j),Q(j, µ(j))}

),

Mµ(Q, J)(i) def= min

u∈U(i)Fµ(Q, J)(i, u)

Key fact: A sup-norm contraction w/ common fixed point (Q∗, J∗) for all µ

max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞

}≤ αmax

{‖Q−Q∗‖∞, ‖J−J∗‖∞

}The asynchronous iteration for the fixed point problem

Q := Fµ(Q, J), J := Mµ(Q, J), µ := arbitrary

is convergent.

Even though we operate with different mappings Fµ and Mµ, they allhave a common fixed point.

Page 91: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Sup-Norm Uniform Contraction Property

Consider Q-factors Q(i, u) and costs J(i). For any µ, define mapping

(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)

)where

Fµ(Q, J)(i, u)def=

n∑j=1

pij (u)(g(i, u, j) + αmin

{J(j),Q(j, µ(j))}

),

Mµ(Q, J)(i) def= min

u∈U(i)Fµ(Q, J)(i, u)

Key fact: A sup-norm contraction w/ common fixed point (Q∗, J∗) for all µ

max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞

}≤ αmax

{‖Q−Q∗‖∞, ‖J−J∗‖∞

}The asynchronous iteration for the fixed point problem

Q := Fµ(Q, J), J := Mµ(Q, J), µ := arbitrary

is convergent.

Even though we operate with different mappings Fµ and Mµ, they allhave a common fixed point.

Page 92: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Policy Iteration: A More Detailed View

Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where

V t (i) = Qt(i, µt (i))

≈ Q-factors of current policy

Processor i at time t does one of three things:

Does nothingOr operates with Fµt at i :

V t+1(i) =n∑

j=1

pij (µt (i))

(g(i, µt (i), j) + αmin

{J t (j),V t (j)}

)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αmin

{J t (j),V t (j)}

)and sets µt+1(i) to a control that attains the minimum.

Convergence follows by the asynchronous convergence theorem.

Page 93: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Policy Iteration: A More Detailed View

Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where

V t (i) = Qt(i, µt (i))

≈ Q-factors of current policy

Processor i at time t does one of three things:

Does nothingOr operates with Fµt at i :

V t+1(i) =n∑

j=1

pij (µt (i))

(g(i, µt (i), j) + αmin

{J t (j),V t (j)}

)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αmin

{J t (j),V t (j)}

)and sets µt+1(i) to a control that attains the minimum.

Convergence follows by the asynchronous convergence theorem.

Page 94: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Policy Iteration: A More Detailed View

Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where

V t (i) = Qt(i, µt (i))

≈ Q-factors of current policy

Processor i at time t does one of three things:

Does nothingOr operates with Fµt at i :

V t+1(i) =n∑

j=1

pij (µt (i))

(g(i, µt (i), j) + αmin

{J t (j),V t (j)}

)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αmin

{J t (j),V t (j)}

)and sets µt+1(i) to a control that attains the minimum.

Convergence follows by the asynchronous convergence theorem.

Page 95: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Policy Iteration: A More Detailed View

Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where

V t (i) = Qt(i, µt (i))

≈ Q-factors of current policy

Processor i at time t does one of three things:

Does nothingOr operates with Fµt at i :

V t+1(i) =n∑

j=1

pij (µt (i))

(g(i, µt (i), j) + αmin

{J t (j),V t (j)}

)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αmin

{J t (j),V t (j)}

)and sets µt+1(i) to a control that attains the minimum.

Convergence follows by the asynchronous convergence theorem.

Page 96: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Policy Iteration: A More Detailed View

Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where

V t (i) = Qt(i, µt (i))

≈ Q-factors of current policy

Processor i at time t does one of three things:

Does nothingOr operates with Fµt at i :

V t+1(i) =n∑

j=1

pij (µt (i))

(g(i, µt (i), j) + αmin

{J t (j),V t (j)}

)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αmin

{J t (j),V t (j)}

)and sets µt+1(i) to a control that attains the minimum.

Convergence follows by the asynchronous convergence theorem.

Page 97: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Asynchronous Policy Iteration: A More Detailed View

Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where

V t (i) = Qt(i, µt (i))

≈ Q-factors of current policy

Processor i at time t does one of three things:

Does nothingOr operates with Fµt at i :

V t+1(i) =n∑

j=1

pij (µt (i))

(g(i, µt (i), j) + αmin

{J t (j),V t (j)}

)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αmin

{J t (j),V t (j)}

)and sets µt+1(i) to a control that attains the minimum.

Convergence follows by the asynchronous convergence theorem.

Page 98: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Williams and Baird Counterexample

Page 99: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Outline

1 Asynchronous Computation of Fixed Points

2 Value and Policy Iteration for Discounted MDP

3 New Asynchronous Policy Iteration

4 Generalizations

Page 100: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Page 101: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Page 102: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Page 103: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Page 104: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Page 105: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Page 106: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Page 107: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

Page 108: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Page 109: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Page 110: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Page 111: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Page 112: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Page 113: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Page 114: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Page 115: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Page 116: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Page 117: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Fixed Points of Concave Sup-Norm Contractions

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Find a fixed point of a mapping T : <n 7→ <n, where the componentsTi (·) : <n 7→ < are concave sup-norm contractions

T is the minimum of a collection of linear mappings Tµ involving theconcave conjugate functions of the concave functions Ti

An asynchronous parametric fixed point algorithm can be used:Linearize at the current JIterate using the linearized mappingUpdate the linearization

All of the above are done asynchronously

Page 118: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Fixed Points of Concave Sup-Norm Contractions

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Find a fixed point of a mapping T : <n 7→ <n, where the componentsTi (·) : <n 7→ < are concave sup-norm contractions

T is the minimum of a collection of linear mappings Tµ involving theconcave conjugate functions of the concave functions Ti

An asynchronous parametric fixed point algorithm can be used:Linearize at the current JIterate using the linearized mappingUpdate the linearization

All of the above are done asynchronously

Page 119: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Fixed Points of Concave Sup-Norm Contractions

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Find a fixed point of a mapping T : <n 7→ <n, where the componentsTi (·) : <n 7→ < are concave sup-norm contractions

T is the minimum of a collection of linear mappings Tµ involving theconcave conjugate functions of the concave functions Ti

An asynchronous parametric fixed point algorithm can be used:Linearize at the current JIterate using the linearized mappingUpdate the linearization

All of the above are done asynchronously

Page 120: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Fixed Points of Concave Sup-Norm Contractions

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1

Find a fixed point of a mapping T : <n 7→ <n, where the componentsTi (·) : <n 7→ < are concave sup-norm contractions

T is the minimum of a collection of linear mappings Tµ involving theconcave conjugate functions of the concave functions Ti

An asynchronous parametric fixed point algorithm can be used:Linearize at the current JIterate using the linearized mappingUpdate the linearization

All of the above are done asynchronously

Page 121: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Concluding Remarks

Asynchronous policy iteration has fragile convergence properties

We have provided a new approach to correct the difficulties

Key idea: Embed the problem into one that involves a uniform sup-normcontraction

Extensions to generalized DP models involving:Other DP models (e.g., stochastic shortest path, games, etc)Fixed point problems for nonDP mappings of the form T = minµ∈MTµ,such as concave sup-norm contractions

Page 122: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Concluding Remarks

Asynchronous policy iteration has fragile convergence properties

We have provided a new approach to correct the difficulties

Key idea: Embed the problem into one that involves a uniform sup-normcontraction

Extensions to generalized DP models involving:Other DP models (e.g., stochastic shortest path, games, etc)Fixed point problems for nonDP mappings of the form T = minµ∈MTµ,such as concave sup-norm contractions

Page 123: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Concluding Remarks

Asynchronous policy iteration has fragile convergence properties

We have provided a new approach to correct the difficulties

Key idea: Embed the problem into one that involves a uniform sup-normcontraction

Extensions to generalized DP models involving:Other DP models (e.g., stochastic shortest path, games, etc)Fixed point problems for nonDP mappings of the form T = minµ∈MTµ,such as concave sup-norm contractions

Page 124: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Concluding Remarks

Asynchronous policy iteration has fragile convergence properties

We have provided a new approach to correct the difficulties

Key idea: Embed the problem into one that involves a uniform sup-normcontraction

Extensions to generalized DP models involving:Other DP models (e.g., stochastic shortest path, games, etc)Fixed point problems for nonDP mappings of the form T = minµ∈MTµ,such as concave sup-norm contractions

Page 125: Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed Asynchronous Computation of Fixed Points and Applications in Dynamic Programming Dimitri

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Thank You!