Exploration-Exploitation in MDPs with...

EWRL - Barcelona, December 2016

Exploration-Exploitation in MDPs with Options

Ronan Fruit Alessandro Lazaric

SequeL – INRIA Lille

Context and Motivations

Online learning with optionsOverview

Assumption: options are assumed to be given as input.

Exploration-Exploitation in MDPs with Options SequeL - 1/22




1. Empirical observations: introducing options in an MDP can speed uplearning ((author?) [5]) but can also be harmful ((author?) [3]).






2. Question: What theoretical properties can explain these observations?







3. Existing algorithms based on options:







3. Existing algorithms based on options:I Model-free learning : SMDP-Q-learning ((author?) [4])







3. Existing algorithms based on options:I Model-free learning : SMDP-Q-learning ((author?) [4])I Model-based learning : SMDP-RMAX, SMDP-E3 ((author?) [1])







3. Existing algorithms based on options:I Model-free learning : SMDP-Q-learning ((author?) [4])I Model-based learning : SMDP-RMAX, SMDP-E3 ((author?) [1])

4. In this presentation: Model-based learning in the regret setting


Notations and Theoretical Background

Outline


Algorithm

AnalysisSMDPsMDPs with Options

Results

Conclusion and Future Work



Markov Decision ProcessesNotations

M = {S,A, p, r} where:

I S is a set of states,

I A = (As)s∈S is a set of actions,

I p(s ′|s, a) is the probability of transition (s, a)→ s ′,

I r(s, a) is the reward of action a in state s.



Temporal AbstractionThe Option Framework

Definition 1A (Markov) Option o is a 3-tuple

{Io , βo , πo

}

where:I Io ⊂ S is the set of initial states,

I βo : S → [0, 1] is a Markov terminationcondition,

I πo ∈ ΠSRM is a stationary Markov policy.



Temporal AbstractionSemi-Markov Decision Processes

Definition 2An SMDP M ′ = {S,A, p, r , τ} is an MDP with a random holding timeτ(s, a) associated with any state-action pair.





Proposition 1 ((author?) [4])A set of options O defined on an MDP M induces an SMDP M ′:M +O =⇒ M ′.






Theorem 3If the set of options is proper, all holding times and rewards aresub-Exponential. Moreover, they are sub-Gaussian if and only if theyare bounded.






Theorem 3If the set of options is proper, all holding times and rewards aresub-Exponential. Moreover, they are sub-Gaussian if and only if theyare bounded.

MDP+Option ⇒ Sub-exponential or bounded SMDP



SMDP Learning protocol

s0

n = 0 T0 = 0 R0 = 0




s0a2

a1

a3

n = 1 T0 = 0 R0 = 0




s0a2

n = 1 T0 = 0 R0 = 0




s0 s1τ1, r1

n = 1 T1 ← T0 + τ1 R1 ← R0 + r1




s0 s1a′2

a′1

a′3

n = 2 T1 = τ1 R1 = r1




s0 s1

a′1

n = 2 T1 = τ1 R1 = r1




s0 s1

s2

τ2, r2

n = 2 T2 ← T1 + τ2 R2 ← R1 + r2

=⇒ Tn =∑n

i=1 τi and Rn =∑n

i=1 ri



Learning Setting1. Tn: number of time steps

n: number of decision steps=⇒ Tn/n: ”scale factor”





2. An optimal policy π∗ for the agent is a policy maximizing thelong-term average reward:

ρ∗(M) = maxπ

{lim

n→+∞Eπ[

RnTn

]}





2. An optimal policy π∗ for the agent is a policy maximizing thelong-term average reward:

ρ∗(M) = maxπ

{lim

n→+∞Eπ[

RnTn

]}

3. The agent aims at minimizing the total regret:

∆(M,A, n) = ρ∗(M)Tn − Rn


Algorithm

Outline


Algorithm


Results



Algorithm

AlgorithmUCRL-based algorithm:

1. Start new episode k ← k + 1


Algorithm


1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an

estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP

|̃r(s, a)− r̂k(s, a)| ≤ βrk(s, a) and τmax ≥ r̃(s, a) ≥ 0

|τ̃(s, a)− τ̂k(s, a)| ≤ βτk (s, a) and τmax ≥ τ̃(s, a) ≥ τmin

and τ̃(s, a) ≥ r̃(s, a)

‖p̃(· | s, a)− p̂k(· | s, a)‖1 ≤ βpk (s, a) and

∑

s′∈Sp̃(s ′ | s, a) = 1


Algorithm






and τ̃(s, a) ≥ r̃(s, a)


∑

s′∈Sp̃(s ′ | s, a) = 1

X̂k : estimated value of X at episode k (empirical average)


Algorithm






and τ̃(s, a) ≥ r̃(s, a)


∑

s′∈Sp̃(s ′ | s, a) = 1

βk : confidence interval at episode k


Algorithm






and τ̃(s, a) ≥ r̃(s, a)


∑

s′∈Sp̃(s ′ | s, a) = 1

τmax = maxs,a{τ̄ (s, a)}: maximal expected duration of optionsτmin = mins,a{τ̄ (s, a)}: minimal expected duration of options


Algorithm




3. Using Value Iteration, compute an (approximately) optimal policy π̃kfor the optimistic SMDP M̃k=⇒ Optimism in the face of uncertainty

M̃k = arg maxM′

{ρ∗ (M ′)}

π̃k = arg maxπ

{ρπ(

M̃k

)}


Algorithm





4. Follow the policy found until the number of new observations hasdoubled for at least one pair (s, a)


Algorithm






5. End episode k


Algorithm






5. End episode k

=⇒ Repeat steps 1 to 5 until i = n


Analysis

Outline


Algorithm


Results



Analysis SMDPs

Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:

∆(M ′,A, n) = O((

D′√

S ′ + C(M ′, n, δ))√

S ′A′n log(nδ

))

with:


Analysis SMDPs


∆(M ′,A, n) = O((

D′√

S ′ + C(M ′, n, δ))√

S ′A′n log(nδ

))

with:I δ: confidence


Analysis SMDPs


∆(M ′,A, n) = O((

D′√

S′ + C(M ′, n, δ))√

S′A′n log(nδ

))

with:I δ: confidenceI S′: number of states


Analysis SMDPs


∆(M ′,A, n) = O((

D′√

S ′ + C(M ′, n, δ))√

S ′A′n log(nδ

))

with:I δ: confidenceI S ′: number of statesI A′: number of actions


Analysis SMDPs


∆(M ′,A, n) = O((

D′√

S ′ + C(M ′, n, δ))√

S ′A′n log(nδ

))

with:I δ: confidenceI S ′: number of statesI A′: number of actionsI D′: diameter of the SMDP


Analysis SMDPs


∆(M ′,A, n) = O((

D′√

S ′ + C(M ′, n, δ))√

S ′A′n log(nδ

))

with:I δ: confidenceI S ′: number of statesI A′: number of actionsI D′: diameter of the SMDP

D(M ′) = maxs,s′∈S

{min

d∈DMDM

{Ed∞[

T (s ′)|s0 = s]}}

where T (s ′) = inf{ n∑

i=1τi : n ∈ N, sn = s ′

}


Analysis SMDPs


∆(M ′,A, n) = O((

D′√

S ′ + C(M ′, n, δ))√

S ′A′n log(nδ

))

with:I δ: confidenceI S ′: number of statesI A′ = maxs |A′s |: maximal number of actions per stateI D′: diameter of the MDPI n: number of decision steps


Analysis SMDPs


∆(M ′,A, n) = O((

D′√

S ′ + C(M′,n, δ))√

S ′A′n log(nδ

))

with:I C(M′,n, δ) = O(Tmax) for bounded r and τ

Tmax: maximal holding time


Analysis SMDPs


∆(M ′,A, n) = O((

D′√

S ′ + C(M′,n, δ))√

S ′A′n log(nδ

))

with:I C(M′,n, δ) = τmax + Cτ

√log( nδ

)for sub-Exponential r and τ

Cτ : sub-Exponential constant


Analysis SMDPs


∆(M ′,A, n) = O((

D′√

S ′ + C(M′,n, δ))√

S ′A′n log(nδ

))

with:I C(M′,n, δ) = τmax + Cτ

√log( nδ

)for sub-Exponential r and τ

Cτ : sub-Exponential constant

Question: Can we use this bound to explain when options can speed uplearning?


Analysis MDPs with Options

Learning in MDPs with Options

ΠM : set of policies in MDP M




ΠMΠOM : set of policies over options inMDP M




ΠM

ρ∗(M)

ΠOM ρ∗O(M)




ΠM

ρ∗(M)

ΠOMρ∗O(M)ρ∗(M) ≥ ρ∗O(M)



Learning in MDPs with OptionsRemark 1

I Regret of SMDP-UCRL in the original MDP M:

∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))





∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))

Question: Comparison of the regret with and without options?





∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))

I Ratio of the regrets SMDP/MDP UCRL:

R =∆(M,SMDP-UCRL,Tn)

∆(M,MDP-UCRL,Tn)

?≤ 1





∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))


R =∆(M,SMDP-UCRL,Tn)

∆(M,MDP-UCRL,Tn)

?≤ 1

Bound for MDPs ((author?) [2]):

∆(M,A,Tn) = O(

DS

√ATn log

(Tnδ

))





∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))


R̃ ∼(

D′D +

Tmax

D√

S

)√OnATn





∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))


R̃ ∼(

D′D +

Tmax

D√

S

)√OnATn

∼ D′D

√OA

√n

Tn


Results

Outline


Algorithm


Results



Results

Illustrative ExperimentsOptions with bounded holding time and reward

I Bounded optionsI A = O = 4I S ′ = S = d2

I D = 2(d − 1)

I τmax? D′?

R̃ ∼ D′D

√OA

√n

Tn


Results

Illustrative ExperimentsOptions with bounded holding time and reward

I Stochastic termination condition:

Starting State

s0 s3s2s1

1Tmax

1Tmax

1Tmax

τmax =Tmax + 1

2 and D′ ≤ D + Tmax(Tmax + 1)

R̃ ∼ D′D

√OA

√n

Tn


Results

Asymptotic bound on the ratio

1 2 3 4 5 6 7 8Maximal duration of options Tmax

0.75

0.80

0.85

0.90

0.95

1.00

1.05

1.10

Rat

ioof

regr

etsR̃

20x20 Grid25x25 Grid30x30 Grid


Results

Simulations

1 2 3 4 5 6 7 8Maximal duration of options Tmax

0.75

0.80

0.85

0.90

0.95

1.00

1.05

1.10

Rat

ioof

regr

etsR

20x20 Grid25x25 Grid30x30 Grid



Outline


Algorithm


Results




Conclusion and Future WorkConclusion:




1. UCRL-based algorithm for SMDPs




1. UCRL-based algorithm for SMDPs2. Regret bound (SMDPs, MDPs with options)




1. UCRL-based algorithm for SMDPs2. Regret bound (SMDPs, MDPs with options)3. Theoretical/empirical comparison





Current limitations of SMDP-UCRL for options:





Current limitations of SMDP-UCRL for options:1. Additional knowledge about options is needed (e.g., sub-Exponential

constants)






constants)2. Similarities between options and correlation between rewards and

holding times not exploited






constants)2. Similarities between options and correlation between rewards and

holding times not exploited

Extensions:1. Automatic discovery of options? [estimating the diameter]



Thank you for your attention



Emma Brunskill and Lihong Li. Pac-inspired option discovery in lifelongreinforcement learning. In Tony Jebara and Eric P. Xing, editors,Proceedings of the 31st International Conference on MachineLearning (ICML-14), pages 316–324. JMLR Workshop andConference Proceedings, 2014.

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regretbounds for reinforcement learning. J. Mach. Learn. Res.,11:1563–1600, August 2010.

Nicholas K. Jong, Todd Hester, and Peter Stone. The utility of temporalabstraction in reinforcement learning. In The Seventh InternationalJoint Conference on Autonomous Agents and Multiagent Systems,May 2008.

Richard Sutton, Doina Precup, and Satinder Singh. Between mdps andsemi-mdps: A framework for temporal abstraction in reinforcementlearning. Artificial Intelligence, 112:181–211, 1999.

Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J. Mankowitz, andShie Mannor. A deep hierarchical approach to lifelong learning inminecraft. CoRR, abs/1604.07255, 2016.


Exploration-Exploitation in MDPs with...

Documents

Transcript of Exploration-Exploitation in MDPs with...