stuff.mit.edudimitrib/Pathologies_ADP.pdf · Summary We considerpolicy iteration with cost function...

Pathologies of Approximate Policy Iteration in DynamicProgramming

Dimitri P. Bertsekas

Laboratory for Information and Decision SystemsMassachusetts Institute of Technology

March 2011

Summary

We consider policy iteration with cost function approximation

Used widely but exhibits very complex behavior and a variety of potentialpathologies

Case of the tetris test problem

Two types of pathologiesDeterministic: Due to cost function approximationStochastic: Due to simulation errors/noise

We survey the pathologies inPolicy evaluation: Due to errors in approximate evaluation of policiesPolicy improvement: Due to policy improvement mechanism

Special focus: Policy oscillations and local attractors

Causes of the problem in TD/projected equation methods:The projection operator may not be monotoneThe projection norm may depend on the policy evaluated

We discuss methods that address the difficulty

References

D. P. Bertsekas, “Pathologies of Temporal Differences Methods inApproximate Dynamic Programming," Proc. 2010 IEEE Conference onDecision and Control, Proc. 2010 IEEE Conference on Decision andControl, Atlanta, GA.

D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. II,2007, Supplementary Chapter on Approximate DP: On-line; a “livingchapter."

MDP: Brief Review

J∗(i) = Optimal cost starting from state i

Jµ(i) = Cost starting from state i using policy µ

Denote by T and Tµ the DP mappings that transform J ∈ <n to thevectors TJ and TµJ with components

(TJ)(i) def= min

u∈U(i)

nXj=1

pij (u)`g(i, u, j) + αJ(j)

´, i = 1, . . . , n,

(TµJ)(i) def=

nXj=1

pij`µ(i)

´`g(i, µ(i), j) + αJ(j)

´, i = 1, . . . , n

α < 1 for a discounted problem; α = 1 and 0-cost termination state for astochastic shortest path problem

Bellman’s equations have unique solution

J∗ = TJ∗, Jµ = TµJµ

µ∗ is optimal (i.e., J∗ = Jµ∗ ) iff Tµ∗J∗ = TJ∗

Policy Iteration: Lookup Table Representation

Policy iteration (exact): Start with any µ

Evaluation of policy µ: Find Jµ

Jµ = TµJµ

A linear equation

Improvement of policy µ: Find µ that attains the min in TJµ, i.e.,

TµJµ = TJµ

Policy iteration converges finitely (if exact)

Illustration of Convergence

J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗

N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0

x− x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}

r0 + D−1Ra(C)

Subspace S = {Φr | r ∈ �s} Set S 0

Limit of {rk}x = Φr

1




r0 + D−1Ra(C)



1




r0 + D−1Ra(C)



1




r0 + D−1Ra(C)



1




r0 + D−1Ra(C)



1

Space of cost vectors J

With exact policy evaluation, convergence is finite and monotonic

Policy Iteration: Cost Function Approximation

An old, time-tested approach for solving large-scale equation problemsApproximation within subspace S = {Φr | r ∈ <s}

J ≈ Φr , Φ is a matrix with basis functions/features as columns

x− x∗ x∗ x∗ − T (x∗) T (x∗) x∗ = Φr∗

Subspace S = {Φr | r ∈ �s} Set S

1

u = 1 Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p 1 0νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2

Slope Jµ Slope Jµ = Φrµ

Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)

c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��

Cost = 0 Cost = −1

νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + αn�

j=1

pijJ∗(j)

Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0

J1 = T 2µ0J0

1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + αn�

j=1

pijJ∗(j)


J1 = T 2µ0J0

1

Instead of Jµ, find Jµ = Φr ∈ S by some form of “projection" onto S

Jµ = WTµ(Jµ) or equivalently Φrµ = WTµ(Φrµ)

Example: A projected equation/Galerkin method: W = Π (a Euclideanprojection)Example: An aggregation method: W = ΦD, where Φ (aggregationmatrix) and D (disaggregation matrix) have prob. distributions as rows

Approximate Policy Iteration

Start with any µ

Evaluation of policy µ: Solve for Jµ the linear equation

Jµ = WTµ(Jµ)


TµJµ = T Jµ

Special twists that originated in Reinforcement Learning/ADP:

Policy evaluation can be done by simulation, with low-dimensional linearalgebraMatrix inversion method LSTD(λ), or iterative methods such as LSPE(λ),TD(λ), λ-policy iteration, etcSimilar aggregation methods

Tetris Case Study

Classical and challenging test problem with huge number of states

Initial policy iteration work (VanRoy MS Thesis, under J. Tsitsiklis, 1993)- a 10x20 board, 3 basis functions, average score of ≈ 40 points

Most studies have used a 10x20 board, and a set of “standard" 22 basisfunctions introduced by Bertsekas and Ioffe (1996)

Approximate policy iteration [B+I (1996), Lagoudakis and Parr (2003)]

Policy gradient method [Kakade (2002)]

Approximate LP [Farias+VanRoy (2006), Desai+Farias+Moallemi (2009)]All of the above achieved average scores in the range 3,000-6,000

BUT with a random search method Szita and Lorenz (2006), and Thierryand Sherrer (2009) achieved scores 600,000-900,000

Potential Pathologies

General issue:Good cost approximation =⇒ good performance of generated policies??Bad cost approximation =⇒ bad performance of generated policies??(Can add a constant to the cost of all states without affecting the nextgenerated policy)

Policy evaluation issues (both can be quantified to some extent)BiasSimulation error/noise

x− x∗ x∗ x∗ − T (x∗) T (x∗) x∗ = Φr∗


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + α

n�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0

1


Simulation error Jµ Slope Jµ = Φrµ


c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + α

n�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0

1


Simulation error ΠJµ Slope Jµ = Φrµ


c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + α

n�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0

1


Simulation error Bias ΠJµ Slope Jµ = Φrµ


c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + α

n�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0

1


Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ


c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + α

n�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1

Tetris Case Study

Classical and challenging test problem with huge number of statesInitial policy iteration work (VanRoy MS Thesis, under J. Tsitsiklis, 1993)- 3 basis functions, achieved average score of 40 pointsMost studies have used a set of “standard" 22 basis functions introducedby Bertsekas and Ioffe (1996)Approximate policy iteration [B+I (1996), Lagoudakis and Parr(2003)]TD(1) TD(0)Approximate LP [Farias+VanRoy (2006), Desai+Farias+Moallemi (2009)]Policy gradient method [Kakade (2002)]All of the above achieved average scores in the range 3,000-6,000BUT with a random search method Szita and Lorenz (2006), and Thierryand Sherrer (2009) achieved scores 600,000-900,000

Tetris Case Study

Classical and challenging test problem with huge number of statesInitial policy iteration work (VanRoy MS Thesis, under J. Tsitsiklis, 1993)- 3 basis functions, achieved average score of 40 pointsMost studies have used a set of “standard" 22 basis functions introducedby Bertsekas and Ioffe (1996)Approximate policy iteration [B+I (1996), Lagoudakis and Parr(2003)]TD(1) TD(0)Approximate LP [Farias+VanRoy (2006), Desai+Farias+Moallemi (2009)]Policy gradient method [Kakade (2002)]All of the above achieved average scores in the range 3,000-6,000BUT with a random search method Szita and Lorenz (2006), and Thierryand Sherrer (2009) achieved scores 600,000-900,000

Tetris Case Study

Classical and challenging test problem with huge number of statesInitial policy iteration work (VanRoy MS Thesis, under J. Tsitsiklis, 1993)- 3 basis functions, achieved average score of 40 pointsMost studies have used a set of “standard" 22 basis functions introducedby Bertsekas and Ioffe (1996)Approximate policy iteration [B+I (1996), Lagoudakis and Parr (2003)]TD(λ) TD(0)Approximate LP [Farias+VanRoy (2006), Desai+Farias+Moallemi (2009)]Policy gradient method [Kakade (2002)]All of the above achieved average scores in the range 3,000-6,000BUT with a random search method Szita and Lorenz (2006), and Thierryand Sherrer (2009) achieved scores 600,000-900,000Policy iteration issues (hard to quantify and understand)

Oscillations of policies (local attractors; like local minima)Exploration (simulation must ensure that all parts of the state space areadequately sampled/explored)

Policy Evaluation - Bias Issues - An Example

1 2 3 i i + 1 νOptimal reward over policies that never retireCost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0

J0 J1 = TJ0 J2 = T 2J0 J J∗ = TJ∗ TJ Tµ1J Jµ0 = Tµ0Jµ0 Tµ0J

x0 x2 x1 = R(x0) J2 = TJ1 x x∗ = R(x∗) TJ Tµ1J Jµ1 = Tµ1Jµ1

S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)

J1 Policy Improvement J2 Policy Improvement J1 Policy EvaluationR(x)

Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t

Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→

(J1, µ1) TJ = minµ TµJ



Policy Improvement Exact Policy Evaluation (Exact if m =∞)

Approx. Policy Evaluation J t �→ (J t+1, µt+1)

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

1


R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0













rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk


1


R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0













rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk


1


R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0













rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk


1


R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0













rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk


1


R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0













rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk


1


rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk


Rµ1 Rµ2 Rµ3 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

�

f�2,Xk

(−λ)

x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k

f�(λ)

Constant− f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�

M =�(u, w) | there exists x ∈ X

Original System States Aggregate StatesAggregation Probabilities

2


rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk









�

f�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



2

n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2



c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1

Stochastic shortest path problem with 0: termination state (fromBertsekas 1995; Neural Computation, Vol. 7)Consider a linear approximation of the form

Jµ(i) = i r

- 50.0

- 25.0

0.0

25.0

50.0

0 10 20 30 40 50State i

Cost function J(i)

TD(1) Approximation

TD(0) Approximation

Policy Evaluation - Bias Issues - An Example


R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0













rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk


1


R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0













rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk


1


R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0













rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk


1


R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0













rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk


1


R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0













rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk


1


R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


J1 = T 2µ0J0













rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk


1


rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk









�

f�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



2


rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk









�

f�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



2




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1

Prob. ! N(!) = N(C) Ra(!!) = N(!)" R# = {r | !r = x#} r0

x! x# = !(r ! r#) x# x# ! T (x#) T (x#) "T (x#) = x# = !r# {rk}

r0 + D$1Ra(C)

Subspace S = {!r | r " #s} Set S 0

Limit of {rk}x = !r

1

Consider a linear approximation of the form

Jµ(i) = i r

- 50.0

- 25.0

0.0

25.0

50.0

0 10 20 30 40 50State i

Cost function J(i)

TD(1) Approximation

TD(0) Approximation

A strange twist: Introduce an ε-probability reverse decision at state n − 1Policy iteration/TD(0) yields the optimal policyPolicy iteration/TD(1) does not

Policy Evaluation - Sensitivity to Simulation Noise

Consider the evaluation equation Φr = WTµ(Φr)

It is equivalent to a linear equation Cr = d with C a positive definite(nonsymmetric) matrix

In popular approaches, we compute by simulation C ≈ C and d ≈ d

The solution Φr = ΦC−1d may be highly sensitive to simulation error

x− x∗ x∗ x∗ − T (x∗) T (x∗) x∗ = Φr∗


1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + αn�

j=1

pijJ∗(j)


J1 = T 2µ0J0

1




c + Ez

�J∗

�pf0(z)

pf0(z) + (1− p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1− νj(u)ν 1− νi(u)

ν 1− νk(u)ν

1− µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + α

n�

j=1

pijJ∗(j)


1

Simulation error Prob. ! N(!) = N(C) Ra(!!) = N(!)" R# = {r |!r = x#} r0

x! x# = !(r ! r#) x# x# ! T (x#) T (x#) "T (x#) = x# = !r# {rk}

r0 + D$1Ra(C)

Subspace S = {!r | r " #s} Set S 0

Limit of {rk}x = !r

1

This necessitates lots of sampling ... confidence interval/convergencerate analysis needed (Konda Ph.D. Thesis 2002)

Can happen even without subspace approximation/lookup tablerepresentation (S = <n)

Regularization methods may be used, but they introduce additional bias... need to quantify

Policy Improvement - Oscillations

Consider the space of weights r (policy µ is evaluated as Jµ = Φrµ)

Rµ = set of r for which µ is greedy: Tµ(Φr) = T (Φr) (Greedy Partition)

µ improves to µ iff rµ ∈ Rµ

rµk rµk+1 rµk+2 rµk+3

Rµk Rµk+1 Rµk+2 Rµk+3







�

f�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�


Original System States Aggregate StatesAggregation ProbabilitiesDisaggregation Probabilities Column Sampling According toMarkov ChainRow Sampling According to ξ (May Use Markov Chain Q)

F (x) H(y) y h(y)

supz∈Z

infx∈X

φ(x, z) ≤ supz∈Z

infx∈X

φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1









�

f�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1









�

f�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1









�

f�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1









�

f�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1









�

f�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1









�

f�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1









�

f�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1

Space of weights r

Rµ =�r | Tµ(Φr) = T (Φr)

�




r0 + D−1Ra(C)



1

The algorithm ends up repeating a cycle of policies µk , µk+1, . . . , µk+m:

rµk ∈ Rµk+1 , rµk+1 ∈ Rµk+2 , . . . , rµk+m−1 ∈ Rµk+m , rµk+m ∈ Rµk

The greedy partition depends only on Φ - is independent of the policyevaluation method used

Back to Tetris

10x20 board, set of “standard" 22 basis functions

Approximate policy iteration [Bertsekas and Ioffe (1996), Lagoudakisand Parr (2003)]

Approximate LP [Farias+VanRoy (2006), Desai+Farias+Moallemi (2009)]

Policy gradient method [Kakade (2002)]

All of the above achieved average scores in the range 3,000-6,000

BUT with a random search method Szita and Lorenz (2006), and Thierryand Sherrer (2009) achieved scores 600,000-900,000

What’s Going on in Tetris?

Exact Optimal = ?

Approximate PIApproximate LPPolicy gradient

3000-6000

Random Search 600,000-900,000

Based on tests with a smaller board: Oscillations occur often in “badparts of the weight space". Not clear if oscillations are the problem

Random search and well-designed aggregation methods achieve ascore very close to the exact optimal

The basis functions are very powerful (approx. optimal ≈ exact optimal)

Starting from an excellent weight vector, approximate policy iterationdrifts off to cycle around a significantly inferior weight vector

Starting from a bad weight vector, approximate policy iteration drifts offto cycle around a better but not good weight vector

Search for Remedies

Consider again approximation within subspace S = {Φr | r ∈ <s}

Problem with oscillations: Projection is not monotone (also depends onµ)

Remedy: Replace projection by a constant monotone operator W withrange S

Policy evaluation using an approximate Bellman equation: Find Jµ with

Jµ = WTµ(Jµ) instead of Jµ = ΠTµ(Jµ)

Policy iteration (approximate): Start with any µ

Evaluation of policy µ: Solve for Jµ the equation

Jµ = WTµ(Jµ)


TµJµ = T Jµ

Conditions for Convergence

Convergence Result: Assume the following:

(a) W is monotone: WJ ≤ WJ′ for any two J, J′ ∈ <n with J ≤ J′

(b) For each µ, WTµ is a contraction

(c) Termination when µ is obtained such that TµJµ = T Jµ

Then the method terminates in a finite number of iterations, and the costvector obtained upon termination is a fixed point of WT .

Proof is similar to classical proof of convergence of exact policy iteration

Contraction assumption can be weakened: For all J such that(WTµ)(J) ≤ J, we must have

Jµ = limk→∞

(WTµ)k (J)

More general DP models can be accommodated.

Convergence within the Approximation Subspace

Cost Approximation Subspace

Φrµ0 Φrµ1 Φrµ2 Φrµ3 Φr∗

Slope: xi, i < k Slope: xk P

C Slope λ0 Slope λ1 Slope λ2 f(x) f�(λ) λ F (x) F �(λ)

Slopes λ ∈ Λ Break points λ ∈ Λ Points xλ ∈ X

New slope λi

New break point xi

Outer linearization F of f

Inner linearization F � of conjugate f�

Slope λ∗ Slope y −f�1 (λ) f�

2 (−y) f�1 (y) + f�

2 (−y) q(y)

Primal description: Values f(x) Dual description: Crossing points f�(y)

w∗ = minx

�f1(x) + f2(x)

�= max

y

�f�1 (y) + f�

2 (−y)�

= q∗

fx(d) d

�(g(x), f(x)) | x ∈ X

�


Outer Linearization of f

F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

Shapley-Folkman Theorem: Let S = S1 + · · · + Sm with Si ⊂ �n, i = 1, . . . ,m

If s ∈ conv(S) then s = s1 + · · · + sm where

si ∈ conv(Si) for all i = 1, . . . ,m,

si ∈ Si for at least m− n− 1 indices i.

The sum of a large number of convex sets is almost convex

Nonconvexity of the sum is caused by a small number (n + 1) of sets

f(x) = (cl )f(x)

q∗ = (cl )p(0) ≤ p(0) = w∗

1





New slope λi

New break point xi




2 (−y) f�1 (y) + f�

2 (−y) q(y)


w∗ = minx

�f1(x) + f2(x)

�= max

y

�f�1 (y) + f�

2 (−y)�

= q∗

fx(d) d

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)







f(x) = (cl )f(x)

q∗ = (cl )p(0) ≤ p(0) = w∗

1





New slope λi

New break point xi




2 (−y) f�1 (y) + f�

2 (−y) q(y)


w∗ = minx

�f1(x) + f2(x)

�= max

y

�f�1 (y) + f�

2 (−y)�

= q∗

fx(d) d

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)







f(x) = (cl )f(x)

q∗ = (cl )p(0) ≤ p(0) = w∗

1





New slope λi

New break point xi




2 (−y) f�1 (y) + f�

2 (−y) q(y)


w∗ = minx

�f1(x) + f2(x)

�= max

y

�f�1 (y) + f�

2 (−y)�

= q∗

fx(d) d

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)







f(x) = (cl )f(x)

q∗ = (cl )p(0) ≤ p(0) = w∗

1





New slope λi

New break point xi




2 (−y) f�1 (y) + f�

2 (−y) q(y)


w∗ = minx

�f1(x) + f2(x)

�= max

y

�f�1 (y) + f�

2 (−y)�

= q∗

fx(d) d

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)







f(x) = (cl )f(x)

q∗ = (cl )p(0) ≤ p(0) = w∗

1

Convergence is finite and monotonic ... but how good is the limit?

Methods for Selecting W

Aggregation: W = ΦD with rows of Φ and D being probabilitydistributions (this is a serious restriction)

Hard aggregation is an interesting special case: Then W is also aprojectionAnother approach: No restriction on Φ (advantage when we have adesirable Φ)

“Double" the number of columns so that Φ ≥ 0 (separate + and − parts ofthe columns)Let W = ΦD. Choose W by some optimization criterion subject to D ≥ 0and W (sup-norm) nonexpansive, i.e.,

φ(i)′ζ ≤ 1, ∀ states i,

where φ(i)′ is the i th row of Φ, and ζ is the vector of row sums of D.

A special possibility: Start with Φ ≥ 0, and use

W = γ ΦM−1Φ′Ξ,

where γ ≈ 1 and M is a (constant) positive definite diagonal replacementof Φ′ΞΦ in the projection formula

Π = Φ(Φ′ΞΦ)−1Φ′Ξ

Some Perspective

There are several pathologies in approximate PI ... How bad is that?

Other methods have pathologies, e.g., gradient methods that may beattracted to local minima.

This does not mean that they are not useful ...

... BUT in approximate PI the pathologies are many and diverse

... makes it hard to know what went wrong

Other approximate DP methods also have their own pathologies

Need better understanding of the pathologies, how to fix them and howto detect them

What’s going on in tetris?

stuff.mit.edudimitrib/Pathologies_ADP.pdf · Summary We considerpolicy iteration with cost function...

Documents

Transcript of stuff.mit.edudimitrib/Pathologies_ADP.pdf · Summary We considerpolicy iteration with cost function...