Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed...

Asynchronous Computation of Fixed Points Value and Policy Iteration for Discounted MDP New Asynchronous Policy Iteration Generalizations

Distributed Asynchronous Computation of Fixed Points andApplications in Dynamic Programming

Dimitri P. Bertsekas

Laboratory for Information and Decision SystemsMassachusetts Institute of Technology

PCO ’13Cambridge, MA

May 2013

Joint Work with Huizhen (Janey) Yu

Summary

Background Review

Asynchronous iterative fixed point methods

Convergence issues

Dynamic programming (DP) applications

New Research on Asynchronous DP Algorithms

Asynchronous policy iteration (asynchronous value and policy updatesby multiple processors)

Failure of the “natural" algorithm (Williams-Baird counterexample, 1993)

A radical modification of policy iteration/evaluation: Aim to solve anoptimal stopping problem instead of solving a linear system

Convergence properties are restored/enhanced

Generalizations and abstractions

Summary

Background Review

Convergence issues

Summary

Background Review

Convergence issues

Summary

Background Review

Convergence issues

Summary

Background Review

Convergence issues

Summary

Background Review

Convergence issues

Summary

Background Review

Convergence issues

Summary

Background Review

Convergence issues

References

Current work

D. P. Bertsekas and H. Yu, “Q-Learning and Enhanced Policy Iteration inDiscounted Dynamic Programming," Math. of OR, 2012

H. Yu and D. P. Bertsekas, "Q-Learning and Policy Iteration Algorithmsfor Stochastic Shortest Path Problems," to appear in Annals of OR

D. P. Bertsekas and H. Yu, “Distributed Asynchronous Policy Iteration,"Proc. Allerton Conference, Sept. 2010

More works in progress

Connection to early work on theory of totally asynchronous algorithms

D. P. Bertsekas, "Distributed Dynamic Programming," IEEE Transactionson Aut. Control, Vol. AC-27, 1982

D. P. Bertsekas, "Distributed Asynchronous Computation of FixedPoints," Mathematical Programming, Vol. 27, 1983

D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation:Numerical Methods, Prentice-Hall, 1989

References

Current work

D. P. Bertsekas and H. Yu, “Q-Learning and Enhanced Policy Iteration inDiscounted Dynamic Programming," Math. of OR, 2012

H. Yu and D. P. Bertsekas, "Q-Learning and Policy Iteration Algorithmsfor Stochastic Shortest Path Problems," to appear in Annals of OR

D. P. Bertsekas and H. Yu, “Distributed Asynchronous Policy Iteration,"Proc. Allerton Conference, Sept. 2010

More works in progress

Connection to early work on theory of totally asynchronous algorithms

D. P. Bertsekas, "Distributed Dynamic Programming," IEEE Transactionson Aut. Control, Vol. AC-27, 1982

D. P. Bertsekas, "Distributed Asynchronous Computation of FixedPoints," Mathematical Programming, Vol. 27, 1983

D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation:Numerical Methods, Prentice-Hall, 1989

Outline

1 Asynchronous Computation of Fixed Points

2 Value and Policy Iteration for Discounted MDP

3 New Asynchronous Policy Iteration

4 Generalizations

Outline

4 Generalizations

Some History: The First “Real" Implementation of a DistributedAsynchronous iterative Algorithm

Routing algorithm of the ARPANET (1969)

Based on the Bellman-Ford shortest path algorithm

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

where:J(i): Estimated distance of node i to destination node 0 (J(0) = 0)aij : Length of the link connecting i to j

Original ambitious implementation included congestion dependent aij

Failed miserably because of violent oscillations

The algorithm was “stabilized" by making aij essentially constant (1974)

Showing convergence of this algorithm and its DP extensions was myentry point into the field (1979)

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

J(i) = minj=0,1,...,n

{aij + J(j)

}, i = 1, . . . , n,

Early Work on Asynchronous Iterative Algorithms

Initial proposal by Rosenfeld (1967) involved experimentation but noconvergence analysis

Sup-norm linear contractive iterations by Chazan and Miranker (1969)

Sup-norm nonlinear contractive iterations by Miellou (1975), Robert(1976), Baudet (1978)

DP algorithms (Bertsekas 1982, 1983) - both sup norm contractive andalso noncontractive/monotone iterations (e.g., shortest path)

Many subsequent works (nonexpansive, network, and other iterations)

Types of Asynchronism

All asynchronous iterative algorithms proposed up to the early 80sinvolved “total asynchronism" and either sup-norm contractive ormonotone mappings

Subsequent work also involved “partial asynchronism," which could beapplied to additional types of iterations (e.g., gradient methods foroptimization)

Distributed "Totally" Asynchronous Framework for Fixed PointComputation

Consider solution of general fixed point problem J = TJ, or

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

Network of processors, with a separate processor i for each componentJ(i), i = 1, . . . , n

Asynchronous Fixed Point Algorithm

Processor i updates J(i) at a subset of times Ti ⊂ {0, 1, . . .}Processor i receives (possibly outdated values) J(j) from otherprocessors j 6= i

Update of processor i [with “delays" t − τij (t)]

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

J t (i) if t /∈ Ti .

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

J(i) = Ti(J(1), . . . , J(n)

), i = 1, . . . , n

J t+1(i) =

{Ti(Jτi1(t)(1), . . . , Jτin(t)(n)

)if t ∈ Ti ,

Distributed Convergence of Fixed Point Iterations

A general theorem for “totally asynchronous" iterations, i.e., Ti are infinitesets and τij (t)→∞ as t →∞ (Bertsekas, 1983)

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation (Exact if m = ∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S1(0) S2(0)

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x − x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

S1(0) S2(0)

r0 + D−1Ra(C)

Assume there is a nested sequence of sets S(k + 1) ⊂ S(k) such that(Synchronous Convergence Condition) We have

TJ ∈ S(k + 1), ∀ J ∈ S(k),

and the limit points of all sequences {Jk} with Jk ∈ S(k), for all k , are fixedpoints of T .(Box Condition) S(k) is a Cartesian product:

S(k) = S1(k)× · · · × Sn(k)

Then, if J0 ∈ S(0), every limit point of {J t} is a fixed point of T .

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S1(0) S2(0)

r0 + D−1Ra(C)

S1(0) S2(0)

r0 + D−1Ra(C)

TJ ∈ S(k + 1), ∀ J ∈ S(k),

S(k) = S1(k)× · · · × Sn(k)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S1(0) S2(0)

r0 + D−1Ra(C)

S1(0) S2(0)

r0 + D−1Ra(C)

TJ ∈ S(k + 1), ∀ J ∈ S(k),

S(k) = S1(k)× · · · × Sn(k)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S1(0) S2(0)

r0 + D−1Ra(C)

S1(0) S2(0)

r0 + D−1Ra(C)

TJ ∈ S(k + 1), ∀ J ∈ S(k),

S(k) = S1(k)× · · · × Sn(k)

Applications of the Theorem

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S1(0) S2(0)

r0 + D−1Ra(C)

S1(0) S2(0)

r0 + D−1Ra(C)

Major contexts where the theorem applies:

T is a sup-norm contraction with fixed point J∗ and modulus α:

S(k) = {J | ‖J − J∗‖∞ ≤ αk B}, for some scalar B

T is monotone (TJ ≤ TJ ′ for J ≤ J ′) with fixed point J∗. Moreover,

S(k) = {J | T k J ≤ J ≤ T k J}for some J and J with

J ≤ TJ ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.

Both of these apply to various DP problems:1st context applies to discounted problems2nd context applies to undiscounted problems (e.g., shortest paths)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S1(0) S2(0)

r0 + D−1Ra(C)

S1(0) S2(0)

r0 + D−1Ra(C)

J ≤ T J ≤ T J ≤ J,and limk→∞ T k J = limk→∞ T k J = J∗.

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S1(0) S2(0)

r0 + D−1Ra(C)

S1(0) S2(0)

r0 + D−1Ra(C)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S1(0) S2(0)

r0 + D−1Ra(C)

S1(0) S2(0)

r0 + D−1Ra(C)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S1(0) S2(0)

r0 + D−1Ra(C)

S1(0) S2(0)

r0 + D−1Ra(C)

Asynchronous Convergence Mechanism

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

J1 Iterations J2 Iteration

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

Once we are in S(k), iteration of any one component (say J1) keeps usin the current box

Once all components have been iterated once, we pass into the next boxS(k + 1) (permanently)

Asynchronous Convergence Mechanism

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

S(0) S(k) S(k + 1) J∗

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

Once we are in S(k), iteration of any one component (say J1) keeps usin the current box

Once all components have been iterated once, we pass into the next boxS(k + 1) (permanently)

Finding Fixed Points of Parametric Mappings

Problem: Find a fixed point J∗ of a mapping T : <n 7→ <n of the form

(TJ)(i) = minµ∈Mi

(TµJ)(i), i = 1, . . . , n,

where µ is a parameter from some setMi .

Common idea (central in DP policy iteration): Instead of T , we iteratewith some Tµ, then change µ once in a while

J TJ Rµ =�r | Tµ(Φr) = T (Φr)

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

r0 + D−1Ra(C)

J TJ TµJ

r0 + D−1Ra(C)

J TJ TµJ J0 J1 = TµJ0 Jµ J∗ = TJ∗

r0 + D−1Ra(C)

µ-Update Subspace S = {Φr | r ∈ �s} Set S 0

r0 + D−1Ra(C)

Finding Fixed Points of Parametric Mappings

Problem: Find a fixed point J∗ of a mapping T : <n 7→ <n of the form

(TµJ)(i), i = 1, . . . , n,

where µ is a parameter from some setMi .

Common idea (central in DP policy iteration): Instead of T , we iteratewith some Tµ, then change µ once in a while

r0 + D−1Ra(C)

J TJ TµJ

r0 + D−1Ra(C)

Distributed Asynchronous Parametric Fixed Point Iterations

Partition J and µ into components, J(1), . . . , J(n) and µ(1), . . . , µ(n)

Asynchronous Algorithm

Processor i , at each time does one of three things:Does nothing orUpdates J(i) by setting it to

J(i) := (Tµ(i)J)(i)

orUpdates J(i) and µ(i) by setting

(TJ)(i) := minµ∈Mi

(TµJ)(i)

andµ(i) := arg min

µ∈Mi(TµJ)(i)

A very chaotic environment!

Distributed Asynchronous Parametric Fixed Point Iterations

Partition J and µ into components, J(1), . . . , J(n) and µ(1), . . . , µ(n)

Asynchronous Algorithm

Processor i , at each time does one of three things:Does nothing orUpdates J(i) by setting it to

J(i) := (Tµ(i)J)(i)

orUpdates J(i) and µ(i) by setting

(TJ)(i) := minµ∈Mi

(TµJ)(i)

andµ(i) := arg min

µ∈Mi(TµJ)(i)

A very chaotic environment!

Difficulty with Parametric Fixed Point Algorithm

J∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Policy Evaluation Improvement Exploration Enhancement

νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

J∗ Jµ Jµ� Jµ��Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

: Fixed Point of T

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

: Fixed Point of

Distributed Asynchronous Computation of Fixed Points Classical Value and Policy Iteration for Discounted MDP Distributed Asynchronous Policy Iteration Interlocking Research Directions - Generalizations

A More Abstract View of What Follows

We want to find a fixed point J∗ of a mapping T : �n �→ �n of the form

(TµJ)(i), i = 1, . . . , n,

where µ is a parameter from some set Mi .

We update J in two ways:

Iterate with T : J �→ TJ (cf. value iteration/DP), ORPick a µ and iterate with Tµ: J �→ TµJ (cf. policy evaluation/DP)

Difficulty: Tµ has different fixed point than T ... so iterations with Tµ aimat a target other than J∗

Our key idea (abstractly): Embed both T and Tµ within another (uniform)contraction mapping Fµ that has the same fixed point for all µ

The uniform contraction mapping Fµ operates on the larger space ofQ-factors

In the DP context, Fµ is associated with an optimal stopping problem

Most of what follows applies beyond DP

Tµ has different fixed point than T ... so the target of the iterations keepschanging

A (restrictive) convergence result: If each Tµ is both a sup-normcontraction and is monotone, convergence can be shown assuming thatTµ0 J0 ≤ J0

Without the condition Tµ0 J0 ≤ J0, the synchronous algorithm convergesbut the asynchronous algorithm may oscillate

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

: Fixed Point of T

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

: Fixed Point of

(TµJ)(i), i = 1, . . . , n,

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

: Fixed Point of T

LP CONVEX NLP

Simplex

Vertical Distances

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

: Fixed Point of

(TµJ)(i), i = 1, . . . , n,

Addressing the Difficulty: A High-Level View

Assume that each Tµ is a sup-norm contraction

Our approach: Embed both T and Tµ within another (uniform)contraction mapping Gµ that has the same fixed point for all µ

The uniform contraction mapping Gµ operates on a larger space

In the DP context, Gµ is associated with an optimal stopping problem

Outline

4 Generalizations

Dynamic Programming - Markovian Decision Problems (MDP)

System: Controlled Markov chain w/ transition probabilities pij (u)

States: i = 1, . . . , n

Controls: u ∈ U(i)

Cost per stage: g(i, u, j)

Stationary policy: State to control mapping µ; apply µ(i) when at state i

Discounted MDP: Find policy µ that minimizes the expected value of theinfinite horizon cost: ∞∑

αk g(ik , µ(ik ), ik+1

)where

ik = state at time k , ik+1 = state at time k + 1,

α : discount factor 0 < α < 1

States: i = 1, . . . , n

)where

States: i = 1, . . . , n

)where

States: i = 1, . . . , n

)where

States: i = 1, . . . , n

)where

States: i = 1, . . . , n

)where

Major DP Results

J∗(i) = Optimal cost starting from state i

Jµ(i) = Optimal cost starting from state i using policy µ

Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ∗(j)

), i = 1, . . . , n

A parametric fixed point equation.

J∗ is the unique fixed point.

An optimal policy minimizes for each i in the RHS of Bellman’s equation.

Bellman’s equation for a policy µ:

Jµ(i) =n∑

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

It is a linear system of equations with Jµ as its unique solution.

Major DP Results

n∑j=1

), i = 1, . . . , n

Jµ(i) =n∑

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

Major DP Results

n∑j=1

), i = 1, . . . , n

Jµ(i) =n∑

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

Major DP Results

n∑j=1

), i = 1, . . . , n

Jµ(i) =n∑

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

Major DP Results

n∑j=1

), i = 1, . . . , n

Jµ(i) =n∑

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

Major DP Results

n∑j=1

), i = 1, . . . , n

Jµ(i) =n∑

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

Major DP Results

n∑j=1

), i = 1, . . . , n

Jµ(i) =n∑

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

Major DP Results

n∑j=1

), i = 1, . . . , n

Jµ(i) =n∑

pij(µ(i)

)(g(i, µ(i), j) + αJµ(j)

), i = 1, . . . , n

Shorthand Notation - Fixed Point View

Denote by T and Tµ the mappings that map J ∈ <n to the vectors TJand TµJ with components

(TJ)(i) def= min

u∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ(j)

), i = 1, . . . , n,

(TµJ)(i) def=

n∑j=1

pij(µ(i)

)(g(i, µ(i), j) + αJ(j)

), i = 1, . . . , n

Bellman’s equations are written as

J∗ = TJ∗ = minµ

TµJ∗, Jµ = TµJµ

Key structure: T and Tµ are sup-norm contractions with modulus α,

‖TJ−TJ ′‖∞ = maxi=1,...,n

∣∣(TJ)(i)−(TJ ′)(i)∣∣ ≤ α max

i=1,...,n

∣∣J(i)−J ′(i)∣∣ = α‖J−J ′‖∞

(TJ)(i) def= min

u∈U(i)

n∑j=1

), i = 1, . . . , n,

(TµJ)(i) def=

n∑j=1

pij(µ(i)

)(g(i, µ(i), j) + αJ(j)

), i = 1, . . . , n

‖TJ−TJ ′‖∞ = maxi=1,...,n

∣∣(TJ)(i)−(TJ ′)(i)∣∣ ≤ α max

i=1,...,n

∣∣J(i)−J ′(i)∣∣ = α‖J−J ′‖∞

(TJ)(i) def= min

u∈U(i)

n∑j=1

), i = 1, . . . , n,

(TµJ)(i) def=

n∑j=1

pij(µ(i)

)(g(i, µ(i), j) + αJ(j)

), i = 1, . . . , n

‖TJ−TJ ′‖∞ = maxi=1,...,n

∣∣(TJ)(i)−(TJ ′)(i)∣∣ ≤ α max

i=1,...,n

∣∣J(i)−J ′(i)∣∣ = α‖J−J ′‖∞

Major (Synchronous) Methods for Finding Fixed Point of T

Value iteration (generic fixed point method): Start with any J0, iterate by

J t+1 = TJ t

Policy iteration (special method for T of the form T = minµ Tµ): Startwith any J0 and µ0. Given J t and µt , iterate by:

Policy evaluation: J t+1 = (Tµt )mJ t (m applications of Tµt on J t ; m =∞ ispossible)Policy improvement: µt+1 attains the min in TJ t+1 (or Tµt+1 J t+1 = TJ t+1)

Both methods converge to J∗:

Value iteration, thanks to contraction of TPolicy iteration, thanks to contraction and monotonicity of T and Tµ

Typically, policy iteration (with a reasonable choice of m) is more efficientbecause application of Tµ is cheaper than application of T

J t+1 = TJ t

Graphical Interpretation of Policy Iteration

J0 J1 = TJ0 J2 = TJ1 J J∗ = TJ∗ TJCost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

f�(λ)

Constant − f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

M =�(u,w) | there exists x ∈ X

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

f�(λ)

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

f�(λ)

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

f�(λ)

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

M =�(u, w) | there exists x ∈ X

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

f�(λ)

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

M =�(u, w) | there exists x ∈ X

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1

Cost =0 Cost = c < 0 Prob. = 1 − p Prob. = 1 Prob. = p P

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

f�(λ)

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1

Policy Improvement Exact Policy Evaluation Approximate PolicyEvaluation

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

f�(λ)

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→ (J1, µ1)

TJ = minµ TµJ

Policy Improvement Exact Policy Evaluation Approx. Policy Evalu-ation

rµ∗ ≈ c

1 − α,

αrµ = 0

k Stages j1 j2 jk

f�2,Xk

(−λ)

f�(λ)

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

Policy Improvement

Policy Evaluation

Fixed Point

Failure of Asynchronous Policy Iteration: W-B Example

Counterexample by Williams and Baird (1993)

Deterministic discounted MDP with 6 states arranged in a circle

2 controls available in half the states, 1 control available in the other half

Policy evaluations and improvements are one state at a time, no “delays"

A cycle of 15 iterations is constructed that repeats the initial conditions

Outline

4 Generalizations

Rectifying the Difficulty - Q-Factors

Q-factors, Q(i, u) are functions of state-control pairs (i, u)

The optimal Q-factors are given by

Q∗(i, u) =n∑

)Q∗(i, u): Cost of starting at i , using u first, then use optimal policy.

Bellman’s equation J∗ = TJ∗ is written as

Q∗(i, u)

These two equations constitute a fixed point equation in (Q, J)

For a fixed policy µ, the corresponding fixed point equation in (Qµ, Jµ) is

Qµ(i, u) =n∑

pij (u)(g(i, u, j) + αQµ(j, µ(j))

)Jµ(i) = Qµ(i, µ(i))

Q∗(i, u) =n∑

Q∗(i, u)

Qµ(i, u) =n∑

Q∗(i, u) =n∑

Q∗(i, u)

Qµ(i, u) =n∑

Q∗(i, u) =n∑

Q∗(i, u)

Qµ(i, u) =n∑

Sup-Norm Uniform Contraction Property

Consider Q-factors Q(i, u) and costs J(i). For any µ, define mapping

(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)

)where

Fµ(Q, J)(i, u)def=

n∑j=1

pij (u)(g(i, u, j) + αmin

{J(j),Q(j, µ(j))}

Mµ(Q, J)(i) def= min

u∈U(i)Fµ(Q, J)(i, u)

Key fact: A sup-norm contraction w/ common fixed point (Q∗, J∗) for all µ

max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞

}≤ αmax

{‖Q−Q∗‖∞, ‖J−J∗‖∞

}The asynchronous iteration for the fixed point problem

Q := Fµ(Q, J), J := Mµ(Q, J), µ := arbitrary

is convergent.

Even though we operate with different mappings Fµ and Mµ, they allhave a common fixed point.

(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)

)where

Fµ(Q, J)(i, u)def=

n∑j=1

{J(j),Q(j, µ(j))}

max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞

}≤ αmax

{‖Q−Q∗‖∞, ‖J−J∗‖∞

is convergent.

(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)

)where

Fµ(Q, J)(i, u)def=

n∑j=1

{J(j),Q(j, µ(j))}

max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞

}≤ αmax

{‖Q−Q∗‖∞, ‖J−J∗‖∞

is convergent.

(Q, J) 7→(Fµ(Q, J),Mµ(Q, J)

)where

Fµ(Q, J)(i, u)def=

n∑j=1

{J(j),Q(j, µ(j))}

max{‖Fµ(Q, J)−Q∗‖∞, ‖Mµ(Q, J)−J∗‖∞

}≤ αmax

{‖Q−Q∗‖∞, ‖J−J∗‖∞

is convergent.

Asynchronous Policy Iteration: A More Detailed View

Asynchronous distributed policy iteration algorithm: Maintains J t , µt , and V t ,where

V t (i) = Qt(i, µt (i))

≈ Q-factors of current policy

Processor i at time t does one of three things:

Does nothingOr operates with Fµt at i :

V t+1(i) =n∑

pij (µt (i))

(g(i, µt (i), j) + αmin

{J t (j),V t (j)}

)and leaves J t (i), µt (i) unchanged.Or operates with Mµt at i : Set

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

{J t (j),V t (j)}

)and sets µt+1(i) to a control that attains the minimum.

Convergence follows by the asynchronous convergence theorem.

V t (i) = Qt(i, µt (i))

V t+1(i) =n∑

pij (µt (i))

{J t (j),V t (j)}

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

{J t (j),V t (j)}

V t (i) = Qt(i, µt (i))

V t+1(i) =n∑

pij (µt (i))

{J t (j),V t (j)}

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

{J t (j),V t (j)}

V t (i) = Qt(i, µt (i))

V t+1(i) =n∑

pij (µt (i))

{J t (j),V t (j)}

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

{J t (j),V t (j)}

V t (i) = Qt(i, µt (i))

V t+1(i) =n∑

pij (µt (i))

{J t (j),V t (j)}

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

{J t (j),V t (j)}

V t (i) = Qt(i, µt (i))

V t+1(i) =n∑

pij (µt (i))

{J t (j),V t (j)}

J t+1(i) = V t+1(i) = minu∈U(i)

n∑j=1

{J t (j),V t (j)}

Williams and Baird Counterexample

Outline

4 Generalizations

Generalized Mappings T and Tµ

The preceding analysis uses only the contraction property of thediscounted MDP (not monotonicity or the probabilistic structure)

Abstract mappings T and Tµ:Introduce a mapping H(i, u, J) and denote

(TJ)(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

)i.e., TJ = minµ TµJ, where the min is taken separately for each component

Assume that for all i and u ∈ U(i)∣∣H(i, u, J)− H(i, u, J′)∣∣ ≤ α‖J − J′‖∞

Asynchronous algorithm: At time t , processor i does one of three things:Does nothingOr does a “policy evaluation" at i : Sets

V t+1(i) = H(i, µt (i),min{J t ,V t}

)and leaves J t (i), µt (i) unchanged.Or does a “policy improvement" at i : Sets

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u,min{J t ,V t}

)sets µt+1(i) to a u that attains the minimum.

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

J t+1(i) = V t+1(i) = minu∈U(i)

H(i, u, J), (TµJ)(i) = H(i, µ(i), J

J t+1(i) = V t+1(i) = minu∈U(i)

DP Applications with Generalized T and Tµ

DP models beyond discounted with standard policy evaluation

Optimistic/modified policy iteration for semi-Markov and minimax discountedproblemsStochastic shortest path problemsSemi-Markov decision problemsStochastic zero sum gamesSequential minimax problems

Multi-agent aggregation

Each agent updates costs at all states within a subsetEach agent uses detailed costs for the local states and aggregate costs forother states, as communicated by other agents

Fixed Points of Concave Sup-Norm Contractions

r0 + D−1Ra(C)

J TJ TµJ

r0 + D−1Ra(C)

Find a fixed point of a mapping T : <n 7→ <n, where the componentsTi (·) : <n 7→ < are concave sup-norm contractions

T is the minimum of a collection of linear mappings Tµ involving theconcave conjugate functions of the concave functions Ti

An asynchronous parametric fixed point algorithm can be used:Linearize at the current JIterate using the linearized mappingUpdate the linearization

All of the above are done asynchronously

r0 + D−1Ra(C)

J TJ TµJ

r0 + D−1Ra(C)

J TJ TµJ

r0 + D−1Ra(C)

J TJ TµJ

r0 + D−1Ra(C)

Concluding Remarks

Asynchronous policy iteration has fragile convergence properties

We have provided a new approach to correct the difficulties

Key idea: Embed the problem into one that involves a uniform sup-normcontraction

Extensions to generalized DP models involving:Other DP models (e.g., stochastic shortest path, games, etc)Fixed point problems for nonDP mappings of the form T = minµ∈MTµ,such as concave sup-norm contractions

Concluding Remarks

Thank You!

Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed...

Documents

Transcript of Distributed Asynchronous Computation of Fixed …dimitrib/Asynchronous_Plen_Talk.pdfDistributed...