Improved Bounds for Policy Iteration in Markov Decision ...

Improved Bounds for Policy Iteration in Markov Decision Problems

Shivaram Kalyanakrishnan

Department of Computer Science and EngineeringIndian Institute of Technology Bombay

shivaram@cse.iitb.ac.in

November 2017

Collaborators: Neeldhara Misra, Aditya Gopalan, Utkarsh Mall, Ritish Goyal, Anchit Gupta

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 1 / 9

November 2017

Sequential Decision Making

ActSense

ENVIRONMENT

actionrewardstate

ActSense

ENVIRONMENT

actionrewardstate

https://img.tradeindia.com/fp/1/524/panoramic-elevators-564.jpg

ActSense

ENVIRONMENT

actionrewardstate

https://img.tradeindia.com/fp/1/524/panoramic-elevators-564.jpg

http://www.nature.com/polopoly_fs/7.33483.1453824868!/image/WEB_Go—1.jpg_gen/derivatives/landscape_630/WEB_Go—1.jpg

Markov Decision Problems (MDPs)

s s1 2

0.5, 0

0.5, 3

0.25, −1

1, 10.5, −1

0.5, 30.75, −2

s s1 2

0.5, 0

0.5, 3

0.25, −1

1, 10.5, −1

0.5, 30.75, −2

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

s s1 2

0.5, 0

0.5, 3

0.25, −1

1, 10.5, −1

0.5, 30.75, −2

Elements of an MDP

States (S)

Actions (A)

Rewards (R)

Behaviour is encoded as a Policy π, which maps states to actions.

s s1 2

0.5, 0

0.5, 3

0.25, −1

1, 10.5, −1

0.5, 3

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)

Rewards (R)

s s1 2

0.5, 0

0.5, 3

0.25, −1

1, 10.5, −1

0.5, 3

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)

Rewards (R)

What is a “good” policy?

s s1 2

0.5, 0

0.5, 3

0.25, −1

1, 10.5, −1

0.5, 3

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)

Rewards (R)

What is a “good” policy? One that maximises expected long-term reward.

s s1 2

0.5, 0

0.5, 3

0.25, −1

1, 10.5, −1

0.5, 3

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)

Rewards (R)

Vπ is the Value Function of π. For s ∈ S,

Vπ(s) = Eπ

r0 + γr1 + γ2r2 + . . . |start state = s

s s1 2

0.5, 0

0.5, 3

0.25, −1

1, 10.5, −1

0.5, 3

RED 0.75, −2

γ = 0.9

Elements of an MDP

States (S)

Actions (A)

Rewards (R)

Discount factor (γ)

Vπ is the Value Function of π. For s ∈ S,

Vπ(s) = Eπ

r0 + γr1 + γ2r2 + . . . |start state = s

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

Optimal Policies

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45BBB 10.0 11.0 10.0

Optimal Policies

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45 ← Optimal policyBBB 10.0 11.0 10.0

Optimal Policies

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

Every MDP is guaranteed to have an optimal policy π⋆, such that

∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).

Optimal Policies

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).

What is the complexity of computing an optimal policy?

Optimal Policies

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).

Note: an MDP with |S| = n states and |A| = k actions has a total of kn policies.

Optimal Policies

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).

Note: an MDP with |S| = n states and |A| = k actions has a total of kn policies.

One extra definition needed: Action Value Function Qπ

a for a ∈ A.

a = Ra + γTaVπ.

Given π, a polynomial computation yields Vπ and Q

a for a ∈ A.

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

3Q (s ) ππQ (s ) 3

7Q (s ) ππQ (s ) 7

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

Improvable states

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

Improvable states

Improving actions

Policy Improvement

Given π,

Pick one or more improvable states, and in them,

Switch to an arbitrary improving action.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

Improvable states

Improving actions

Policy Improvement

Given π,

s s s s s s ss1 2 3 4 5 6 7 8

Policy Improvement

Given π,

s s s s s s ss1 2 3 4 5 6 7 8

Policy Improvement

Policy Improvement Theorem (H60, B12):

(1) If π has no improvable states, then it is optimal, else

(2) if π′ is obtained as above, then

∀s ∈ S : Vπ′

(s) ≥ Vπ(s) and ∃s ∈ S : V

(s) > Vπ(s).

Policy Improvement

Given π,

s s s s s s ss1 2 3 4 5 6 7 8

Policy Improvement

∀s ∈ S : Vπ′

(s) ≥ Vπ(s) and ∃s ∈ S : V

(s) > Vπ(s).

Policy Improvement

Given π,

s s s s s s ss1 2 3 4 5 6 7 8

Policy Improvement

∀s ∈ S : Vπ′

(s) ≥ Vπ(s) and ∃s ∈ S : V

(s) > Vπ(s).

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Different switching strategies lead to different routes to the top.

How long are the routes?!

Switching Strategies and Bounds

Upper bounds on number of iterations

PI Variant Type k = 2 General k

Howard’s PIDeterministic O

[H60, MS99]

Mansour and Singh’sRandomised 1.7172n ≈ O

Randomised PI [MS99]

[H60, MS99]

Lower bounds on number of iterations

Ω(n) Howard’s PI on n-state, 2-action MDPs [HZ10].

[H60, MS99]

Ω(1.4142n) Simple PI on n-state, 2-action MDPs [MC94].

[H60, MS99]

Batch-switching PIDeterministic 1.6479n

k0.7207n

[KMG16a, GK17]

Recursive Simple PIRandomised – (2 + ln(k − 1))n

[KMG16b]

Ω(1.4142n) Simple PI on n-state, 2-action MDPs [MC94].

Recursive Simple Policy Iteration

s s s s s s ss1 2 3 4 5 6 7 8

Given π,

Pick the improvable state with the highest index, and,

Switch to an improving action picked uniformly at random.

s s s s s s ss1 2 3 4 5 6 7 8

Given π,

s s s s s s ss1 2 3 4 5 6 7 8

Given π,

s s s s s s ss1 2 3 4 5 6 7 8

Policy Improvement

Given π,

s s s s s s ss1 2 3 4 5 6 7 8

Policy Improvement

Expected number of iterations: (1+Hk−1)n ≤ (2+ ln(k−1))n.

Given π,

s s s s s s ss1 2 3 4 5 6 7 8

Expected number of iterations: (1+Hk−1)n ≤ (2+ ln(k−1))n.

Conclusion

Policy Iteration: widely used algorithm, more than half a century old.

Substantial gap exists between upper and lower bounds.

We furnish several exponential improvements to upper bounds.

Conclusion

Bears similarity to Simplex algorithm for Linear Programming.

Howard’s PI works much better in practice than the variants for which we have

shown improved upper bounds!

Open problem: Is the number of iterations taken by Howard’s PI on n-state,

2-action MDPs upper-bounded by the (n + 2)-nd Fibonacci number?

Conclusion

For references see tutorial.

Theoretical Analysis of Policy IterationTutorial at IJCAI 2017https://www.cse.iitb.ac.in/ shivaram/resources/ijcai-2017-tutorial-policyiteration/index.html.

Conclusion

For references see tutorial.

Theoretical Analysis of Policy IterationTutorial at IJCAI 2017https://www.cse.iitb.ac.in/ shivaram/resources/ijcai-2017-tutorial-policyiteration/index.html.

Thank you!

Improved Bounds for Policy Iteration in Markov Decision ...

Documents

Transcript of Improved Bounds for Policy Iteration in Markov Decision ...

Regret Bounds for Restless Markov Banditspersonal.unileoben.ac.at/rortner/Pubs/MarkovBandits.pdf · Regret Bounds for Restless Markov Bandits Ronald Ortner , Daniil Ryabko , Peter

T-76.4115 Iteration demo T-76.4115 Iteration Demo Neula PP Iteration 21.10.2008.

Partial Policy Iteration for -Robust Markov Decision Processes

A Markov Chain Approximation to Choice Modelingjblanche/papers/MC_paper.pdf · the Markov chain model and the true underlying model. These bounds show that the Markov chain model

Markov Chains with Memory, Tensor Formulation, …mtchu/Research/Papers/power_tpt_04.pdf · Markov Chains with Memory, Tensor Formulation, and the Dynamics of Power Iteration Sheng-Jhih

Serial Inventory Systems with Markov-Modulated Demand ...lc91/More/papers/Chen_Song_Zhang... · Serial Inventory Systems with Markov-Modulated Demand: Derivative Bounds, Asymptotic

1 Discrete Math CS 2800 Prof. Bart Selman selman@cs.cornell.edu Module Probability --- Part d) 1) Probability Distributions 2) Markov and Chebyshev Bounds.

Agile · Agile Inception Iteration 1 Iteration 2 Retrospective (every 2-4 iterations) Release Planning Meeting (every 12-24 iterations) Iteration 3 ... Iteration n. Iteration

Policy Iteration for Decentralized Control of Markov

The Policy Iteration Algorithm For Average Reward Markov ...The Policy Iteration Algorithm for Average Reward Markov Decision Processes with General State Space Sean P. Meyn, Senior

Lower Bounds on the Iteration Time and the Initiation ...aboulham/pdfs_sources/IBjournal.pdf · Lower Bounds on the Iteration Time and the Initiation Interval of Functional Pipelining

Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Stein Point Markov Chain Monte Carlo11-11-00)-11...2000/11/11 · 8/13 Stein Point Markov Chain Monte Carlo (SP-MCMC) We propose to replace the global minimisation at each iteration

Information Relaxation Bounds for In nite Horizon Markov ...mh2078/infinite_horizon_R3_final.pdf · Markov Decision Processes David B. Brown Fuqua School of Business Duke University

Markovian bounds on functions of nite Markov chains · Markovian bounds on functions of nite Markov chains James Ledoux, Laurent Tru et To cite this version: James Ledoux, Laurent

Decision Theory: Markov Decision Processeskevinlb/teaching/cs322 - 2005-6/Lectures/lec… · Recap Rewards and Policies Value Iteration Asynchronous Value Iteration Markov Decision

Tight bounds on sparse perturbations of Markov Chains Romain Hollanders Giacomo Como Jean-Charles Delvenne Raphaël Jungers UCLouvain University of Lund.

Instruction Scheduling - Stanford Universityinfolab.stanford.edu/~ullman/dragon/w06/lectures/inst-sched.pdf · Iteration i Iteration i+ 4 Iteration i+ 1 Iteration i+ 3 Iteration i+

Computable Exponential Bounds for Markov Chains and MCMC ...€¦ · 1. Nonasymptotic Bounds for Markov Chains Motivation: Markov Chain Monte Carlo 2. A General Information-Theoretic

Topological Value Iteration Algorithmshomes.cs.washington.edu/~weld/papers/dai-jair11.pdfLexington, KY 40508 Abstract Value iteration is a powerful yet inefﬁcient algorithm for Markov