Pathologies of Approximate Policy Iteration in DynamicProgramming
Dimitri P. Bertsekas
Laboratory for Information and Decision SystemsMassachusetts Institute of Technology
March 2011
Summary
We consider policy iteration with cost function approximation
Used widely but exhibits very complex behavior and a variety of potentialpathologies
Case of the tetris test problem
Two types of pathologiesDeterministic: Due to cost function approximationStochastic: Due to simulation errors/noise
We survey the pathologies inPolicy evaluation: Due to errors in approximate evaluation of policiesPolicy improvement: Due to policy improvement mechanism
Special focus: Policy oscillations and local attractors
Causes of the problem in TD/projected equation methods:The projection operator may not be monotoneThe projection norm may depend on the policy evaluated
We discuss methods that address the difficulty
References
D. P. Bertsekas, “Pathologies of Temporal Differences Methods inApproximate Dynamic Programming," Proc. 2010 IEEE Conference onDecision and Control, Proc. 2010 IEEE Conference on Decision andControl, Atlanta, GA.
D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. II,2007, Supplementary Chapter on Approximate DP: On-line; a “livingchapter."
MDP: Brief Review
J∗(i) = Optimal cost starting from state i
Jµ(i) = Cost starting from state i using policy µ
Denote by T and Tµ the DP mappings that transform J ∈ <n to thevectors TJ and TµJ with components
(TJ)(i) def= min
u∈U(i)
nXj=1
pij (u)`g(i, u, j) + αJ(j)
´, i = 1, . . . , n,
(TµJ)(i) def=
nXj=1
pij`µ(i)
´`g(i, µ(i), j) + αJ(j)
´, i = 1, . . . , n
α < 1 for a discounted problem; α = 1 and 0-cost termination state for astochastic shortest path problem
Bellman’s equations have unique solution
J∗ = TJ∗, Jµ = TµJµ
µ∗ is optimal (i.e., J∗ = Jµ∗ ) iff Tµ∗J∗ = TJ∗
Policy Iteration: Lookup Table Representation
Policy iteration (exact): Start with any µ
Evaluation of policy µ: Find Jµ
Jµ = TµJµ
A linear equation
Improvement of policy µ: Find µ that attains the min in TJµ, i.e.,
TµJµ = TJµ
Policy iteration converges finitely (if exact)
Illustration of Convergence
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x− x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x− x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x− x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x− x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x− x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
Space of cost vectors J
With exact policy evaluation, convergence is finite and monotonic
Policy Iteration: Cost Function Approximation
An old, time-tested approach for solving large-scale equation problemsApproximation within subspace S = {Φr | r ∈ <s}
J ≈ Φr , Φ is a matrix with basis functions/features as columns
x− x∗ x∗ x∗ − T (x∗) T (x∗) x∗ = Φr∗
Subspace S = {Φr | r ∈ �s} Set S
1
u = 1 Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p 1 0νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Slope Jµ Slope Jµ = Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + αn�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
1
u = 1 Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p 1 0νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Slope Jµ Slope Jµ = Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + αn�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
1
Instead of Jµ, find Jµ = Φr ∈ S by some form of “projection" onto S
Jµ = WTµ(Jµ) or equivalently Φrµ = WTµ(Φrµ)
Example: A projected equation/Galerkin method: W = Π (a Euclideanprojection)Example: An aggregation method: W = ΦD, where Φ (aggregationmatrix) and D (disaggregation matrix) have prob. distributions as rows
Approximate Policy Iteration
Start with any µ
Evaluation of policy µ: Solve for Jµ the linear equation
Jµ = WTµ(Jµ)
Improvement of policy µ: Find µ that attains the min in TJµ, i.e.,
TµJµ = T Jµ
Special twists that originated in Reinforcement Learning/ADP:
Policy evaluation can be done by simulation, with low-dimensional linearalgebraMatrix inversion method LSTD(λ), or iterative methods such as LSPE(λ),TD(λ), λ-policy iteration, etcSimilar aggregation methods
Tetris Case Study
Classical and challenging test problem with huge number of states
Initial policy iteration work (VanRoy MS Thesis, under J. Tsitsiklis, 1993)- a 10x20 board, 3 basis functions, average score of ≈ 40 points
Most studies have used a 10x20 board, and a set of “standard" 22 basisfunctions introduced by Bertsekas and Ioffe (1996)
Approximate policy iteration [B+I (1996), Lagoudakis and Parr (2003)]
Policy gradient method [Kakade (2002)]
Approximate LP [Farias+VanRoy (2006), Desai+Farias+Moallemi (2009)]All of the above achieved average scores in the range 3,000-6,000
BUT with a random search method Szita and Lorenz (2006), and Thierryand Sherrer (2009) achieved scores 600,000-900,000
Potential Pathologies
General issue:Good cost approximation =⇒ good performance of generated policies??Bad cost approximation =⇒ bad performance of generated policies??(Can add a constant to the cost of all states without affecting the nextgenerated policy)
Policy evaluation issues (both can be quantified to some extent)BiasSimulation error/noise
x− x∗ x∗ x∗ − T (x∗) T (x∗) x∗ = Φr∗
Subspace S = {Φr | r ∈ �s} Set S
1
u = 1 Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p 1 0νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Slope Jµ Slope Jµ = Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + α
n�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
1
u = 1 Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p 1 0νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Jµ Slope Jµ = Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + α
n�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
1
u = 1 Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p 1 0νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error ΠJµ Slope Jµ = Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + α
n�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
1
u = 1 Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p 1 0νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Bias ΠJµ Slope Jµ = Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + α
n�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
1
u = 1 Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p 1 0νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + α
n�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
Tetris Case Study
Classical and challenging test problem with huge number of statesInitial policy iteration work (VanRoy MS Thesis, under J. Tsitsiklis, 1993)- 3 basis functions, achieved average score of 40 pointsMost studies have used a set of “standard" 22 basis functions introducedby Bertsekas and Ioffe (1996)Approximate policy iteration [B+I (1996), Lagoudakis and Parr(2003)]TD(1) TD(0)Approximate LP [Farias+VanRoy (2006), Desai+Farias+Moallemi (2009)]Policy gradient method [Kakade (2002)]All of the above achieved average scores in the range 3,000-6,000BUT with a random search method Szita and Lorenz (2006), and Thierryand Sherrer (2009) achieved scores 600,000-900,000
Tetris Case Study
Classical and challenging test problem with huge number of statesInitial policy iteration work (VanRoy MS Thesis, under J. Tsitsiklis, 1993)- 3 basis functions, achieved average score of 40 pointsMost studies have used a set of “standard" 22 basis functions introducedby Bertsekas and Ioffe (1996)Approximate policy iteration [B+I (1996), Lagoudakis and Parr(2003)]TD(1) TD(0)Approximate LP [Farias+VanRoy (2006), Desai+Farias+Moallemi (2009)]Policy gradient method [Kakade (2002)]All of the above achieved average scores in the range 3,000-6,000BUT with a random search method Szita and Lorenz (2006), and Thierryand Sherrer (2009) achieved scores 600,000-900,000
Tetris Case Study
Classical and challenging test problem with huge number of statesInitial policy iteration work (VanRoy MS Thesis, under J. Tsitsiklis, 1993)- 3 basis functions, achieved average score of 40 pointsMost studies have used a set of “standard" 22 basis functions introducedby Bertsekas and Ioffe (1996)Approximate policy iteration [B+I (1996), Lagoudakis and Parr (2003)]TD(λ) TD(0)Approximate LP [Farias+VanRoy (2006), Desai+Farias+Moallemi (2009)]Policy gradient method [Kakade (2002)]All of the above achieved average scores in the range 3,000-6,000BUT with a random search method Szita and Lorenz (2006), and Thierryand Sherrer (2009) achieved scores 600,000-900,000Policy iteration issues (hard to quantify and understand)
Oscillations of policies (local attractors; like local minima)Exploration (simulation must ensure that all parts of the state space areadequately sampled/explored)
Policy Evaluation - Bias Issues - An Example
1 2 3 i i + 1 νOptimal reward over policies that never retireCost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
J0 J1 = TJ0 J2 = T 2J0 J J∗ = TJ∗ TJ Tµ1J Jµ0 = Tµ0Jµ0 Tµ0J
x0 x2 x1 = R(x0) J2 = TJ1 x x∗ = R(x∗) TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Policy Improvement J2 Policy Improvement J1 Policy EvaluationR(x)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m =∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
1
1 2 3 i i + 1 νOptimal reward over policies that never retireCost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
J0 J1 = TJ0 J2 = T 2J0 J J∗ = TJ∗ TJ Tµ1J Jµ0 = Tµ0Jµ0 Tµ0J
x0 x2 x1 = R(x0) J2 = TJ1 x x∗ = R(x∗) TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Policy Improvement J2 Policy Improvement J1 Policy EvaluationR(x)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m =∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
1
1 2 3 i i + 1 νOptimal reward over policies that never retireCost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
J0 J1 = TJ0 J2 = T 2J0 J J∗ = TJ∗ TJ Tµ1J Jµ0 = Tµ0Jµ0 Tµ0J
x0 x2 x1 = R(x0) J2 = TJ1 x x∗ = R(x∗) TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Policy Improvement J2 Policy Improvement J1 Policy EvaluationR(x)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m =∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
1
1 2 3 i i + 1 νOptimal reward over policies that never retireCost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
J0 J1 = TJ0 J2 = T 2J0 J J∗ = TJ∗ TJ Tµ1J Jµ0 = Tµ0Jµ0 Tµ0J
x0 x2 x1 = R(x0) J2 = TJ1 x x∗ = R(x∗) TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Policy Improvement J2 Policy Improvement J1 Policy EvaluationR(x)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m =∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
1
1 2 3 i i + 1 νOptimal reward over policies that never retireCost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
J0 J1 = TJ0 J2 = T 2J0 J J∗ = TJ∗ TJ Tµ1J Jµ0 = Tµ0Jµ0 Tµ0J
x0 x2 x1 = R(x0) J2 = TJ1 x x∗ = R(x∗) TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Policy Improvement J2 Policy Improvement J1 Policy EvaluationR(x)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m =∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
1
1 2 3 i i + 1 νOptimal reward over policies that never retireCost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
J0 J1 = TJ0 J2 = T 2J0 J J∗ = TJ∗ TJ Tµ1J Jµ0 = Tµ0Jµ0 Tµ0J
x0 x2 x1 = R(x0) J2 = TJ1 x x∗ = R(x∗) TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Policy Improvement J2 Policy Improvement J1 Policy EvaluationR(x)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m =∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
1
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant− f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Original System States Aggregate StatesAggregation Probabilities
2
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant− f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Original System States Aggregate StatesAggregation Probabilities
2
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
Stochastic shortest path problem with 0: termination state (fromBertsekas 1995; Neural Computation, Vol. 7)Consider a linear approximation of the form
Jµ(i) = i r
- 50.0
- 25.0
0.0
25.0
50.0
0 10 20 30 40 50State i
Cost function J(i)
TD(1) Approximation
TD(0) Approximation
Policy Evaluation - Bias Issues - An Example
1 2 3 i i + 1 νOptimal reward over policies that never retireCost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
J0 J1 = TJ0 J2 = T 2J0 J J∗ = TJ∗ TJ Tµ1J Jµ0 = Tµ0Jµ0 Tµ0J
x0 x2 x1 = R(x0) J2 = TJ1 x x∗ = R(x∗) TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Policy Improvement J2 Policy Improvement J1 Policy EvaluationR(x)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m =∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
1
1 2 3 i i + 1 νOptimal reward over policies that never retireCost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
J0 J1 = TJ0 J2 = T 2J0 J J∗ = TJ∗ TJ Tµ1J Jµ0 = Tµ0Jµ0 Tµ0J
x0 x2 x1 = R(x0) J2 = TJ1 x x∗ = R(x∗) TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Policy Improvement J2 Policy Improvement J1 Policy EvaluationR(x)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m =∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
1
1 2 3 i i + 1 νOptimal reward over policies that never retireCost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
J0 J1 = TJ0 J2 = T 2J0 J J∗ = TJ∗ TJ Tµ1J Jµ0 = Tµ0Jµ0 Tµ0J
x0 x2 x1 = R(x0) J2 = TJ1 x x∗ = R(x∗) TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Policy Improvement J2 Policy Improvement J1 Policy EvaluationR(x)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m =∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
1
1 2 3 i i + 1 νOptimal reward over policies that never retireCost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
J0 J1 = TJ0 J2 = T 2J0 J J∗ = TJ∗ TJ Tµ1J Jµ0 = Tµ0Jµ0 Tµ0J
x0 x2 x1 = R(x0) J2 = TJ1 x x∗ = R(x∗) TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Policy Improvement J2 Policy Improvement J1 Policy EvaluationR(x)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m =∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
1
1 2 3 i i + 1 νOptimal reward over policies that never retireCost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
J0 J1 = TJ0 J2 = T 2J0 J J∗ = TJ∗ TJ Tµ1J Jµ0 = Tµ0Jµ0 Tµ0J
x0 x2 x1 = R(x0) J2 = TJ1 x x∗ = R(x∗) TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Policy Improvement J2 Policy Improvement J1 Policy EvaluationR(x)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m =∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
1
1 2 3 i i + 1 νOptimal reward over policies that never retireCost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
J0 J1 = TJ0 J2 = T 2J0 J J∗ = TJ∗ TJ Tµ1J Jµ0 = Tµ0Jµ0 Tµ0J
x0 x2 x1 = R(x0) J2 = TJ1 x x∗ = R(x∗) TJ Tµ1J Jµ1 = Tµ1Jµ1
S(0) S(k) S(k + 1) J∗ J = (J1, J2) TJ = (T1J, T2J)
J1 Policy Improvement J2 Policy Improvement J1 Policy EvaluationR(x)
Policy Iteration J t+1 = TJ t = Tµt+1J t J t+1 = Tmµt J t
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Value Iteration 45 Degree Line J t+1 = TJ t J2 = T 2µ1J1 J0 �→
(J1, µ1) TJ = minµ TµJ
Policy Improvement Exact Policy Evaluation (Exact if m =∞)
Approx. Policy Evaluation J t �→ (J t+1, µt+1)
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
1
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant− f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Original System States Aggregate StatesAggregation Probabilities
2
Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗
rµ∗ ≈c
1− α,
c
αrµ = 0
1 2
k Stages j1 j2 jk
rµ1 rµ2 rµ3 rµk+3
Rµ1 Rµ2 Rµ3 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant− f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Original System States Aggregate StatesAggregation Probabilities
2
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
n− 1 −(n− 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
Prob. ! N(!) = N(C) Ra(!!) = N(!)" R# = {r | !r = x#} r0
x! x# = !(r ! r#) x# x# ! T (x#) T (x#) "T (x#) = x# = !r# {rk}
r0 + D$1Ra(C)
Subspace S = {!r | r " #s} Set S 0
Limit of {rk}x = !r
1
Consider a linear approximation of the form
Jµ(i) = i r
- 50.0
- 25.0
0.0
25.0
50.0
0 10 20 30 40 50State i
Cost function J(i)
TD(1) Approximation
TD(0) Approximation
A strange twist: Introduce an ε-probability reverse decision at state n − 1Policy iteration/TD(0) yields the optimal policyPolicy iteration/TD(1) does not
Policy Evaluation - Sensitivity to Simulation Noise
Consider the evaluation equation Φr = WTµ(Φr)
It is equivalent to a linear equation Cr = d with C a positive definite(nonsymmetric) matrix
In popular approaches, we compute by simulation C ≈ C and d ≈ d
The solution Φr = ΦC−1d may be highly sensitive to simulation error
x− x∗ x∗ x∗ − T (x∗) T (x∗) x∗ = Φr∗
Subspace S = {Φr | r ∈ �s} Set S
1
u = 1 Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p 1 0νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Slope Jµ Slope Jµ = Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + αn�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
J1 = T 2µ0J0
1
u = 1 Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p 1 0νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ
Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)
c + Ez
�J∗
�pf0(z)
pf0(z) + (1− p)f1(z)
��
Cost = 0 Cost = −1
νi(u)pij(u)ν
νj(u)pjk(u)ν
νk(u)pki(u)ν
J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)
J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)
J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)
J∗ =�J∗(1), J∗(2)
�
1− νj(u)ν 1− νi(u)
ν 1− νk(u)ν
1− µi
µµi
µ
Cost = 2�α J0
R + g(1) + αn�
j=1
p1jJ∗(j)
i∗ i∗ − 1
g(i) + α
n�
j=1
pijJ∗(j)
Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0
1
Simulation error Prob. ! N(!) = N(C) Ra(!!) = N(!)" R# = {r |!r = x#} r0
x! x# = !(r ! r#) x# x# ! T (x#) T (x#) "T (x#) = x# = !r# {rk}
r0 + D$1Ra(C)
Subspace S = {!r | r " #s} Set S 0
Limit of {rk}x = !r
1
This necessitates lots of sampling ... confidence interval/convergencerate analysis needed (Konda Ph.D. Thesis 2002)
Can happen even without subspace approximation/lookup tablerepresentation (S = <n)
Regularization methods may be used, but they introduce additional bias... need to quantify
Policy Improvement - Oscillations
Consider the space of weights r (policy µ is evaluated as Jµ = Φrµ)
Rµ = set of r for which µ is greedy: Tµ(Φr) = T (Φr) (Greedy Partition)
µ improves to µ iff rµ ∈ Rµ
rµk rµk+1 rµk+2 rµk+3
Rµk Rµk+1 Rµk+2 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant− f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Original System States Aggregate StatesAggregation ProbabilitiesDisaggregation Probabilities Column Sampling According toMarkov ChainRow Sampling According to ξ (May Use Markov Chain Q)
F (x) H(y) y h(y)
supz∈Z
infx∈X
φ(x, z) ≤ supz∈Z
infx∈X
φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X
supz∈Z
φ(x, z)
1
rµk rµk+1 rµk+2 rµk+3
Rµk Rµk+1 Rµk+2 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant− f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Original System States Aggregate StatesAggregation ProbabilitiesDisaggregation Probabilities Column Sampling According toMarkov ChainRow Sampling According to ξ (May Use Markov Chain Q)
F (x) H(y) y h(y)
supz∈Z
infx∈X
φ(x, z) ≤ supz∈Z
infx∈X
φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X
supz∈Z
φ(x, z)
1
rµk rµk+1 rµk+2 rµk+3
Rµk Rµk+1 Rµk+2 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant− f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Original System States Aggregate StatesAggregation ProbabilitiesDisaggregation Probabilities Column Sampling According toMarkov ChainRow Sampling According to ξ (May Use Markov Chain Q)
F (x) H(y) y h(y)
supz∈Z
infx∈X
φ(x, z) ≤ supz∈Z
infx∈X
φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X
supz∈Z
φ(x, z)
1
rµk rµk+1 rµk+2 rµk+3
Rµk Rµk+1 Rµk+2 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant− f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Original System States Aggregate StatesAggregation ProbabilitiesDisaggregation Probabilities Column Sampling According toMarkov ChainRow Sampling According to ξ (May Use Markov Chain Q)
F (x) H(y) y h(y)
supz∈Z
infx∈X
φ(x, z) ≤ supz∈Z
infx∈X
φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X
supz∈Z
φ(x, z)
1
rµk rµk+1 rµk+2 rµk+3
Rµk Rµk+1 Rµk+2 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant− f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Original System States Aggregate StatesAggregation ProbabilitiesDisaggregation Probabilities Column Sampling According toMarkov ChainRow Sampling According to ξ (May Use Markov Chain Q)
F (x) H(y) y h(y)
supz∈Z
infx∈X
φ(x, z) ≤ supz∈Z
infx∈X
φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X
supz∈Z
φ(x, z)
1
rµk rµk+1 rµk+2 rµk+3
Rµk Rµk+1 Rµk+2 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant− f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Original System States Aggregate StatesAggregation ProbabilitiesDisaggregation Probabilities Column Sampling According toMarkov ChainRow Sampling According to ξ (May Use Markov Chain Q)
F (x) H(y) y h(y)
supz∈Z
infx∈X
φ(x, z) ≤ supz∈Z
infx∈X
φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X
supz∈Z
φ(x, z)
1
rµk rµk+1 rµk+2 rµk+3
Rµk Rµk+1 Rµk+2 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant− f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Original System States Aggregate StatesAggregation ProbabilitiesDisaggregation Probabilities Column Sampling According toMarkov ChainRow Sampling According to ξ (May Use Markov Chain Q)
F (x) H(y) y h(y)
supz∈Z
infx∈X
φ(x, z) ≤ supz∈Z
infx∈X
φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X
supz∈Z
φ(x, z)
1
rµk rµk+1 rµk+2 rµk+3
Rµk Rµk+1 Rµk+2 Rµk+3
Controllable State Components Post-Decision States
State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|
j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
j0 j1 jk jk+1 i0 i1 ik ik+1
u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)
(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)
�
f�2,Xk
(−λ)
x1 x2 x3 x4 Slope: xk+1 Slope: xi, i ≤ k
f�(λ)
Constant− f�1 (λ) f�
2 (−λ) F ∗2,k(−λ)F ∗
k (λ)
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Original System States Aggregate StatesAggregation ProbabilitiesDisaggregation Probabilities Column Sampling According toMarkov ChainRow Sampling According to ξ (May Use Markov Chain Q)
F (x) H(y) y h(y)
supz∈Z
infx∈X
φ(x, z) ≤ supz∈Z
infx∈X
φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X
supz∈Z
φ(x, z)
1
Space of weights r
Rµ =�r | Tµ(Φr) = T (Φr)
�
J0 Jµ0 Jµ1 Jµ2 Jµ3 J∗ = Jµ∗
N(Φ) = N(C) Ra(Φ�) = N(Φ)⊥ R∗ = {r | Φr = x∗} r0
x− x∗ = Φ(r − r∗) x∗ x∗ − T (x∗) T (x∗) ΠT (x∗) = x∗ = Φr∗ {rk}
r0 + D−1Ra(C)
Subspace S = {Φr | r ∈ �s} Set S 0
Limit of {rk}x = Φr
1
The algorithm ends up repeating a cycle of policies µk , µk+1, . . . , µk+m:
rµk ∈ Rµk+1 , rµk+1 ∈ Rµk+2 , . . . , rµk+m−1 ∈ Rµk+m , rµk+m ∈ Rµk
The greedy partition depends only on Φ - is independent of the policyevaluation method used
Back to Tetris
10x20 board, set of “standard" 22 basis functions
Approximate policy iteration [Bertsekas and Ioffe (1996), Lagoudakisand Parr (2003)]
Approximate LP [Farias+VanRoy (2006), Desai+Farias+Moallemi (2009)]
Policy gradient method [Kakade (2002)]
All of the above achieved average scores in the range 3,000-6,000
BUT with a random search method Szita and Lorenz (2006), and Thierryand Sherrer (2009) achieved scores 600,000-900,000
What’s Going on in Tetris?
Exact Optimal = ?
Approximate PIApproximate LPPolicy gradient
3000-6000
Random Search 600,000-900,000
Based on tests with a smaller board: Oscillations occur often in “badparts of the weight space". Not clear if oscillations are the problem
Random search and well-designed aggregation methods achieve ascore very close to the exact optimal
The basis functions are very powerful (approx. optimal ≈ exact optimal)
Starting from an excellent weight vector, approximate policy iterationdrifts off to cycle around a significantly inferior weight vector
Starting from a bad weight vector, approximate policy iteration drifts offto cycle around a better but not good weight vector
Search for Remedies
Consider again approximation within subspace S = {Φr | r ∈ <s}
Problem with oscillations: Projection is not monotone (also depends onµ)
Remedy: Replace projection by a constant monotone operator W withrange S
Policy evaluation using an approximate Bellman equation: Find Jµ with
Jµ = WTµ(Jµ) instead of Jµ = ΠTµ(Jµ)
Policy iteration (approximate): Start with any µ
Evaluation of policy µ: Solve for Jµ the equation
Jµ = WTµ(Jµ)
Improvement of policy µ: Find µ that attains the min in TJµ, i.e.,
TµJµ = T Jµ
Conditions for Convergence
Convergence Result: Assume the following:
(a) W is monotone: WJ ≤ WJ′ for any two J, J′ ∈ <n with J ≤ J′
(b) For each µ, WTµ is a contraction
(c) Termination when µ is obtained such that TµJµ = T Jµ
Then the method terminates in a finite number of iterations, and the costvector obtained upon termination is a fixed point of WT .
Proof is similar to classical proof of convergence of exact policy iteration
Contraction assumption can be weakened: For all J such that(WTµ)(J) ≤ J, we must have
Jµ = limk→∞
(WTµ)k (J)
More general DP models can be accommodated.
Convergence within the Approximation Subspace
Cost Approximation Subspace
Φrµ0 Φrµ1 Φrµ2 Φrµ3 Φr∗
Slope: xi, i < k Slope: xk P
C Slope λ0 Slope λ1 Slope λ2 f(x) f�(λ) λ F (x) F �(λ)
Slopes λ ∈ Λ Break points λ ∈ Λ Points xλ ∈ X
New slope λi
New break point xi
Outer linearization F of f
Inner linearization F � of conjugate f�
Slope λ∗ Slope y −f�1 (λ) f�
2 (−y) f�1 (y) + f�
2 (−y) q(y)
Primal description: Values f(x) Dual description: Crossing points f�(y)
w∗ = minx
�f1(x) + f2(x)
�= max
y
�f�1 (y) + f�
2 (−y)�
= q∗
fx(d) d
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Outer Linearization of f
F (x) H(y) y h(y)
supz∈Z
infx∈X
φ(x, z) ≤ supz∈Z
infx∈X
φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X
supz∈Z
φ(x, z)
Shapley-Folkman Theorem: Let S = S1 + · · · + Sm with Si ⊂ �n, i = 1, . . . ,m
If s ∈ conv(S) then s = s1 + · · · + sm where
si ∈ conv(Si) for all i = 1, . . . ,m,
si ∈ Si for at least m− n− 1 indices i.
The sum of a large number of convex sets is almost convex
Nonconvexity of the sum is caused by a small number (n + 1) of sets
f(x) = (cl )f(x)
q∗ = (cl )p(0) ≤ p(0) = w∗
1
Φrµ0 Φrµ1 Φrµ2 Φrµ3 Φr∗
Slope: xi, i < k Slope: xk P
C Slope λ0 Slope λ1 Slope λ2 f(x) f�(λ) λ F (x) F �(λ)
Slopes λ ∈ Λ Break points λ ∈ Λ Points xλ ∈ X
New slope λi
New break point xi
Outer linearization F of f
Inner linearization F � of conjugate f�
Slope λ∗ Slope y −f�1 (λ) f�
2 (−y) f�1 (y) + f�
2 (−y) q(y)
Primal description: Values f(x) Dual description: Crossing points f�(y)
w∗ = minx
�f1(x) + f2(x)
�= max
y
�f�1 (y) + f�
2 (−y)�
= q∗
fx(d) d
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Outer Linearization of f
F (x) H(y) y h(y)
supz∈Z
infx∈X
φ(x, z) ≤ supz∈Z
infx∈X
φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X
supz∈Z
φ(x, z)
Shapley-Folkman Theorem: Let S = S1 + · · · + Sm with Si ⊂ �n, i = 1, . . . ,m
If s ∈ conv(S) then s = s1 + · · · + sm where
si ∈ conv(Si) for all i = 1, . . . ,m,
si ∈ Si for at least m− n− 1 indices i.
The sum of a large number of convex sets is almost convex
Nonconvexity of the sum is caused by a small number (n + 1) of sets
f(x) = (cl )f(x)
q∗ = (cl )p(0) ≤ p(0) = w∗
1
Φrµ0 Φrµ1 Φrµ2 Φrµ3 Φr∗
Slope: xi, i < k Slope: xk P
C Slope λ0 Slope λ1 Slope λ2 f(x) f�(λ) λ F (x) F �(λ)
Slopes λ ∈ Λ Break points λ ∈ Λ Points xλ ∈ X
New slope λi
New break point xi
Outer linearization F of f
Inner linearization F � of conjugate f�
Slope λ∗ Slope y −f�1 (λ) f�
2 (−y) f�1 (y) + f�
2 (−y) q(y)
Primal description: Values f(x) Dual description: Crossing points f�(y)
w∗ = minx
�f1(x) + f2(x)
�= max
y
�f�1 (y) + f�
2 (−y)�
= q∗
fx(d) d
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Outer Linearization of f
F (x) H(y) y h(y)
supz∈Z
infx∈X
φ(x, z) ≤ supz∈Z
infx∈X
φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X
supz∈Z
φ(x, z)
Shapley-Folkman Theorem: Let S = S1 + · · · + Sm with Si ⊂ �n, i = 1, . . . ,m
If s ∈ conv(S) then s = s1 + · · · + sm where
si ∈ conv(Si) for all i = 1, . . . ,m,
si ∈ Si for at least m− n− 1 indices i.
The sum of a large number of convex sets is almost convex
Nonconvexity of the sum is caused by a small number (n + 1) of sets
f(x) = (cl )f(x)
q∗ = (cl )p(0) ≤ p(0) = w∗
1
Φrµ0 Φrµ1 Φrµ2 Φrµ3 Φr∗
Slope: xi, i < k Slope: xk P
C Slope λ0 Slope λ1 Slope λ2 f(x) f�(λ) λ F (x) F �(λ)
Slopes λ ∈ Λ Break points λ ∈ Λ Points xλ ∈ X
New slope λi
New break point xi
Outer linearization F of f
Inner linearization F � of conjugate f�
Slope λ∗ Slope y −f�1 (λ) f�
2 (−y) f�1 (y) + f�
2 (−y) q(y)
Primal description: Values f(x) Dual description: Crossing points f�(y)
w∗ = minx
�f1(x) + f2(x)
�= max
y
�f�1 (y) + f�
2 (−y)�
= q∗
fx(d) d
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Outer Linearization of f
F (x) H(y) y h(y)
supz∈Z
infx∈X
φ(x, z) ≤ supz∈Z
infx∈X
φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X
supz∈Z
φ(x, z)
Shapley-Folkman Theorem: Let S = S1 + · · · + Sm with Si ⊂ �n, i = 1, . . . ,m
If s ∈ conv(S) then s = s1 + · · · + sm where
si ∈ conv(Si) for all i = 1, . . . ,m,
si ∈ Si for at least m− n− 1 indices i.
The sum of a large number of convex sets is almost convex
Nonconvexity of the sum is caused by a small number (n + 1) of sets
f(x) = (cl )f(x)
q∗ = (cl )p(0) ≤ p(0) = w∗
1
Φrµ0 Φrµ1 Φrµ2 Φrµ3 Φr∗
Slope: xi, i < k Slope: xk P
C Slope λ0 Slope λ1 Slope λ2 f(x) f�(λ) λ F (x) F �(λ)
Slopes λ ∈ Λ Break points λ ∈ Λ Points xλ ∈ X
New slope λi
New break point xi
Outer linearization F of f
Inner linearization F � of conjugate f�
Slope λ∗ Slope y −f�1 (λ) f�
2 (−y) f�1 (y) + f�
2 (−y) q(y)
Primal description: Values f(x) Dual description: Crossing points f�(y)
w∗ = minx
�f1(x) + f2(x)
�= max
y
�f�1 (y) + f�
2 (−y)�
= q∗
fx(d) d
�(g(x), f(x)) | x ∈ X
�
M =�(u, w) | there exists x ∈ X
Outer Linearization of f
F (x) H(y) y h(y)
supz∈Z
infx∈X
φ(x, z) ≤ supz∈Z
infx∈X
φ(x, z) = q∗ = p(0) ≤ p(0) = w∗ = infx∈X
supz∈Z
φ(x, z)
Shapley-Folkman Theorem: Let S = S1 + · · · + Sm with Si ⊂ �n, i = 1, . . . ,m
If s ∈ conv(S) then s = s1 + · · · + sm where
si ∈ conv(Si) for all i = 1, . . . ,m,
si ∈ Si for at least m− n− 1 indices i.
The sum of a large number of convex sets is almost convex
Nonconvexity of the sum is caused by a small number (n + 1) of sets
f(x) = (cl )f(x)
q∗ = (cl )p(0) ≤ p(0) = w∗
1
Convergence is finite and monotonic ... but how good is the limit?
Methods for Selecting W
Aggregation: W = ΦD with rows of Φ and D being probabilitydistributions (this is a serious restriction)
Hard aggregation is an interesting special case: Then W is also aprojectionAnother approach: No restriction on Φ (advantage when we have adesirable Φ)
“Double" the number of columns so that Φ ≥ 0 (separate + and − parts ofthe columns)Let W = ΦD. Choose W by some optimization criterion subject to D ≥ 0and W (sup-norm) nonexpansive, i.e.,
φ(i)′ζ ≤ 1, ∀ states i,
where φ(i)′ is the i th row of Φ, and ζ is the vector of row sums of D.
A special possibility: Start with Φ ≥ 0, and use
W = γ ΦM−1Φ′Ξ,
where γ ≈ 1 and M is a (constant) positive definite diagonal replacementof Φ′ΞΦ in the projection formula
Π = Φ(Φ′ΞΦ)−1Φ′Ξ
Some Perspective
There are several pathologies in approximate PI ... How bad is that?
Other methods have pathologies, e.g., gradient methods that may beattracted to local minima.
This does not mean that they are not useful ...
... BUT in approximate PI the pathologies are many and diverse
... makes it hard to know what went wrong
Other approximate DP methods also have their own pathologies
Need better understanding of the pathologies, how to fix them and howto detect them
What’s going on in tetris?
Top Related