Exploration-Exploitation in MDPs with...

72
EWRL - Barcelona, December 2016 Exploration-Exploitation in MDPs with Options Ronan Fruit Alessandro Lazaric SequeL – INRIA Lille

Transcript of Exploration-Exploitation in MDPs with...

EWRL - Barcelona, December 2016

Exploration-Exploitation in MDPs with Options

Ronan Fruit Alessandro Lazaric

SequeL – INRIA Lille

Context and Motivations

Online learning with optionsOverview

Assumption: options are assumed to be given as input.

Exploration-Exploitation in MDPs with Options SequeL - 1/22

Context and Motivations

Online learning with optionsOverview

Assumption: options are assumed to be given as input.

1. Empirical observations: introducing options in an MDP can speed uplearning ((author?) [5]) but can also be harmful ((author?) [3]).

Exploration-Exploitation in MDPs with Options SequeL - 1/22

Context and Motivations

Online learning with optionsOverview

Assumption: options are assumed to be given as input.

1. Empirical observations: introducing options in an MDP can speed uplearning ((author?) [5]) but can also be harmful ((author?) [3]).

2. Question: What theoretical properties can explain these observations?

Exploration-Exploitation in MDPs with Options SequeL - 1/22

Context and Motivations

Online learning with optionsOverview

Assumption: options are assumed to be given as input.

1. Empirical observations: introducing options in an MDP can speed uplearning ((author?) [5]) but can also be harmful ((author?) [3]).

2. Question: What theoretical properties can explain these observations?

3. Existing algorithms based on options:

Exploration-Exploitation in MDPs with Options SequeL - 1/22

Context and Motivations

Online learning with optionsOverview

Assumption: options are assumed to be given as input.

1. Empirical observations: introducing options in an MDP can speed uplearning ((author?) [5]) but can also be harmful ((author?) [3]).

2. Question: What theoretical properties can explain these observations?

3. Existing algorithms based on options:I Model-free learning : SMDP-Q-learning ((author?) [4])

Exploration-Exploitation in MDPs with Options SequeL - 1/22

Context and Motivations

Online learning with optionsOverview

Assumption: options are assumed to be given as input.

1. Empirical observations: introducing options in an MDP can speed uplearning ((author?) [5]) but can also be harmful ((author?) [3]).

2. Question: What theoretical properties can explain these observations?

3. Existing algorithms based on options:I Model-free learning : SMDP-Q-learning ((author?) [4])I Model-based learning : SMDP-RMAX, SMDP-E3 ((author?) [1])

Exploration-Exploitation in MDPs with Options SequeL - 1/22

Context and Motivations

Online learning with optionsOverview

Assumption: options are assumed to be given as input.

1. Empirical observations: introducing options in an MDP can speed uplearning ((author?) [5]) but can also be harmful ((author?) [3]).

2. Question: What theoretical properties can explain these observations?

3. Existing algorithms based on options:I Model-free learning : SMDP-Q-learning ((author?) [4])I Model-based learning : SMDP-RMAX, SMDP-E3 ((author?) [1])

4. In this presentation: Model-based learning in the regret setting

Exploration-Exploitation in MDPs with Options SequeL - 1/22

Notations and Theoretical Background

Outline

Notations and Theoretical Background

Algorithm

AnalysisSMDPsMDPs with Options

Results

Conclusion and Future Work

Exploration-Exploitation in MDPs with Options SequeL - 2/22

Notations and Theoretical Background

Markov Decision ProcessesNotations

M = {S,A, p, r} where:

I S is a set of states,

I A = (As)s∈S is a set of actions,

I p(s ′|s, a) is the probability of transition (s, a)→ s ′,

I r(s, a) is the reward of action a in state s.

Exploration-Exploitation in MDPs with Options SequeL - 3/22

Notations and Theoretical Background

Temporal AbstractionThe Option Framework

Definition 1A (Markov) Option o is a 3-tuple

{Io , βo , πo

}

where:I Io ⊂ S is the set of initial states,

I βo : S → [0, 1] is a Markov terminationcondition,

I πo ∈ ΠSRM is a stationary Markov policy.

Exploration-Exploitation in MDPs with Options SequeL - 4/22

Notations and Theoretical Background

Temporal AbstractionSemi-Markov Decision Processes

Definition 2An SMDP M ′ = {S,A, p, r , τ} is an MDP with a random holding timeτ(s, a) associated with any state-action pair.

Exploration-Exploitation in MDPs with Options SequeL - 5/22

Notations and Theoretical Background

Temporal AbstractionSemi-Markov Decision Processes

Definition 2An SMDP M ′ = {S,A, p, r , τ} is an MDP with a random holding timeτ(s, a) associated with any state-action pair.

Proposition 1 ((author?) [4])A set of options O defined on an MDP M induces an SMDP M ′:M +O =⇒ M ′.

Exploration-Exploitation in MDPs with Options SequeL - 5/22

Notations and Theoretical Background

Temporal AbstractionSemi-Markov Decision Processes

Definition 2An SMDP M ′ = {S,A, p, r , τ} is an MDP with a random holding timeτ(s, a) associated with any state-action pair.

Proposition 1 ((author?) [4])A set of options O defined on an MDP M induces an SMDP M ′:M +O =⇒ M ′.

Theorem 3If the set of options is proper, all holding times and rewards aresub-Exponential. Moreover, they are sub-Gaussian if and only if theyare bounded.

Exploration-Exploitation in MDPs with Options SequeL - 5/22

Notations and Theoretical Background

Temporal AbstractionSemi-Markov Decision Processes

Definition 2An SMDP M ′ = {S,A, p, r , τ} is an MDP with a random holding timeτ(s, a) associated with any state-action pair.

Proposition 1 ((author?) [4])A set of options O defined on an MDP M induces an SMDP M ′:M +O =⇒ M ′.

Theorem 3If the set of options is proper, all holding times and rewards aresub-Exponential. Moreover, they are sub-Gaussian if and only if theyare bounded.

MDP+Option ⇒ Sub-exponential or bounded SMDP

Exploration-Exploitation in MDPs with Options SequeL - 5/22

Notations and Theoretical Background

SMDP Learning protocol

s0

n = 0 T0 = 0 R0 = 0

Exploration-Exploitation in MDPs with Options SequeL - 6/22

Notations and Theoretical Background

SMDP Learning protocol

s0a2

a1

a3

n = 1 T0 = 0 R0 = 0

Exploration-Exploitation in MDPs with Options SequeL - 6/22

Notations and Theoretical Background

SMDP Learning protocol

s0a2

n = 1 T0 = 0 R0 = 0

Exploration-Exploitation in MDPs with Options SequeL - 6/22

Notations and Theoretical Background

SMDP Learning protocol

s0 s1τ1, r1

n = 1 T1 ← T0 + τ1 R1 ← R0 + r1

Exploration-Exploitation in MDPs with Options SequeL - 6/22

Notations and Theoretical Background

SMDP Learning protocol

s0 s1a′2

a′1

a′3

n = 2 T1 = τ1 R1 = r1

Exploration-Exploitation in MDPs with Options SequeL - 6/22

Notations and Theoretical Background

SMDP Learning protocol

s0 s1

a′1

n = 2 T1 = τ1 R1 = r1

Exploration-Exploitation in MDPs with Options SequeL - 6/22

Notations and Theoretical Background

SMDP Learning protocol

s0 s1

s2

τ2, r2

n = 2 T2 ← T1 + τ2 R2 ← R1 + r2

=⇒ Tn =∑n

i=1 τi and Rn =∑n

i=1 ri

Exploration-Exploitation in MDPs with Options SequeL - 6/22

Notations and Theoretical Background

Learning Setting1. Tn: number of time steps

n: number of decision steps=⇒ Tn/n: ”scale factor”

Exploration-Exploitation in MDPs with Options SequeL - 7/22

Notations and Theoretical Background

Learning Setting1. Tn: number of time steps

n: number of decision steps=⇒ Tn/n: ”scale factor”

2. An optimal policy π∗ for the agent is a policy maximizing thelong-term average reward:

ρ∗(M) = maxπ

{lim

n→+∞Eπ[

RnTn

]}

Exploration-Exploitation in MDPs with Options SequeL - 7/22

Notations and Theoretical Background

Learning Setting1. Tn: number of time steps

n: number of decision steps=⇒ Tn/n: ”scale factor”

2. An optimal policy π∗ for the agent is a policy maximizing thelong-term average reward:

ρ∗(M) = maxπ

{lim

n→+∞Eπ[

RnTn

]}

3. The agent aims at minimizing the total regret:

∆(M,A, n) = ρ∗(M)Tn − Rn

Exploration-Exploitation in MDPs with Options SequeL - 7/22

Algorithm

Outline

Notations and Theoretical Background

Algorithm

AnalysisSMDPsMDPs with Options

Results

Conclusion and Future Work

Exploration-Exploitation in MDPs with Options SequeL - 8/22

Algorithm

AlgorithmUCRL-based algorithm:

1. Start new episode k ← k + 1

Exploration-Exploitation in MDPs with Options SequeL - 9/22

Algorithm

AlgorithmUCRL-based algorithm:

1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an

estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP

|̃r(s, a)− r̂k(s, a)| ≤ βrk(s, a) and τmax ≥ r̃(s, a) ≥ 0

|τ̃(s, a)− τ̂k(s, a)| ≤ βτk (s, a) and τmax ≥ τ̃(s, a) ≥ τmin

and τ̃(s, a) ≥ r̃(s, a)

‖p̃(· | s, a)− p̂k(· | s, a)‖1 ≤ βpk (s, a) and

s′∈Sp̃(s ′ | s, a) = 1

Exploration-Exploitation in MDPs with Options SequeL - 9/22

Algorithm

AlgorithmUCRL-based algorithm:

1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an

estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP

|̃r(s, a)− r̂k(s, a)| ≤ βrk(s, a) and τmax ≥ r̃(s, a) ≥ 0

|τ̃(s, a)− τ̂k(s, a)| ≤ βτk (s, a) and τmax ≥ τ̃(s, a) ≥ τmin

and τ̃(s, a) ≥ r̃(s, a)

‖p̃(· | s, a)− p̂k(· | s, a)‖1 ≤ βpk (s, a) and

s′∈Sp̃(s ′ | s, a) = 1

X̂k : estimated value of X at episode k (empirical average)

Exploration-Exploitation in MDPs with Options SequeL - 9/22

Algorithm

AlgorithmUCRL-based algorithm:

1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an

estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP

|̃r(s, a)− r̂k(s, a)| ≤ βrk(s, a) and τmax ≥ r̃(s, a) ≥ 0

|τ̃(s, a)− τ̂k(s, a)| ≤ βτk (s, a) and τmax ≥ τ̃(s, a) ≥ τmin

and τ̃(s, a) ≥ r̃(s, a)

‖p̃(· | s, a)− p̂k(· | s, a)‖1 ≤ βpk (s, a) and

s′∈Sp̃(s ′ | s, a) = 1

βk : confidence interval at episode k

Exploration-Exploitation in MDPs with Options SequeL - 9/22

Algorithm

AlgorithmUCRL-based algorithm:

1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an

estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP

|̃r(s, a)− r̂k(s, a)| ≤ βrk(s, a) and τmax ≥ r̃(s, a) ≥ 0

|τ̃(s, a)− τ̂k(s, a)| ≤ βτk (s, a) and τmax ≥ τ̃(s, a) ≥ τmin

and τ̃(s, a) ≥ r̃(s, a)

‖p̃(· | s, a)− p̂k(· | s, a)‖1 ≤ βpk (s, a) and

s′∈Sp̃(s ′ | s, a) = 1

τmax = maxs,a{τ̄ (s, a)}: maximal expected duration of optionsτmin = mins,a{τ̄ (s, a)}: minimal expected duration of options

Exploration-Exploitation in MDPs with Options SequeL - 9/22

Algorithm

AlgorithmUCRL-based algorithm:

1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an

estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP

3. Using Value Iteration, compute an (approximately) optimal policy π̃kfor the optimistic SMDP M̃k=⇒ Optimism in the face of uncertainty

M̃k = arg maxM′

{ρ∗ (M ′)}

π̃k = arg maxπ

{ρπ(

M̃k

)}

Exploration-Exploitation in MDPs with Options SequeL - 9/22

Algorithm

AlgorithmUCRL-based algorithm:

1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an

estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP

3. Using Value Iteration, compute an (approximately) optimal policy π̃kfor the optimistic SMDP M̃k=⇒ Optimism in the face of uncertainty

4. Follow the policy found until the number of new observations hasdoubled for at least one pair (s, a)

Exploration-Exploitation in MDPs with Options SequeL - 9/22

Algorithm

AlgorithmUCRL-based algorithm:

1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an

estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP

3. Using Value Iteration, compute an (approximately) optimal policy π̃kfor the optimistic SMDP M̃k=⇒ Optimism in the face of uncertainty

4. Follow the policy found until the number of new observations hasdoubled for at least one pair (s, a)

5. End episode k

Exploration-Exploitation in MDPs with Options SequeL - 9/22

Algorithm

AlgorithmUCRL-based algorithm:

1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an

estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP

3. Using Value Iteration, compute an (approximately) optimal policy π̃kfor the optimistic SMDP M̃k=⇒ Optimism in the face of uncertainty

4. Follow the policy found until the number of new observations hasdoubled for at least one pair (s, a)

5. End episode k

=⇒ Repeat steps 1 to 5 until i = n

Exploration-Exploitation in MDPs with Options SequeL - 9/22

Analysis

Outline

Notations and Theoretical Background

Algorithm

AnalysisSMDPsMDPs with Options

Results

Conclusion and Future Work

Exploration-Exploitation in MDPs with Options SequeL - 10/22

Analysis SMDPs

Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:

∆(M ′,A, n) = O((

D′√

S ′ + C(M ′, n, δ))√

S ′A′n log(nδ

))

with:

Exploration-Exploitation in MDPs with Options SequeL - 11/22

Analysis SMDPs

Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:

∆(M ′,A, n) = O((

D′√

S ′ + C(M ′, n, δ))√

S ′A′n log(nδ

))

with:I δ: confidence

Exploration-Exploitation in MDPs with Options SequeL - 11/22

Analysis SMDPs

Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:

∆(M ′,A, n) = O((

D′√

S′ + C(M ′, n, δ))√

S′A′n log(nδ

))

with:I δ: confidenceI S′: number of states

Exploration-Exploitation in MDPs with Options SequeL - 11/22

Analysis SMDPs

Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:

∆(M ′,A, n) = O((

D′√

S ′ + C(M ′, n, δ))√

S ′A′n log(nδ

))

with:I δ: confidenceI S ′: number of statesI A′: number of actions

Exploration-Exploitation in MDPs with Options SequeL - 11/22

Analysis SMDPs

Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:

∆(M ′,A, n) = O((

D′√

S ′ + C(M ′, n, δ))√

S ′A′n log(nδ

))

with:I δ: confidenceI S ′: number of statesI A′: number of actionsI D′: diameter of the SMDP

Exploration-Exploitation in MDPs with Options SequeL - 11/22

Analysis SMDPs

Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:

∆(M ′,A, n) = O((

D′√

S ′ + C(M ′, n, δ))√

S ′A′n log(nδ

))

with:I δ: confidenceI S ′: number of statesI A′: number of actionsI D′: diameter of the SMDP

D(M ′) = maxs,s′∈S

{min

d∈DMDM

{Ed∞[

T (s ′)|s0 = s]}}

where T (s ′) = inf{ n∑

i=1τi : n ∈ N, sn = s ′

}

Exploration-Exploitation in MDPs with Options SequeL - 11/22

Analysis SMDPs

Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:

∆(M ′,A, n) = O((

D′√

S ′ + C(M ′, n, δ))√

S ′A′n log(nδ

))

with:I δ: confidenceI S ′: number of statesI A′ = maxs |A′s |: maximal number of actions per stateI D′: diameter of the MDPI n: number of decision steps

Exploration-Exploitation in MDPs with Options SequeL - 11/22

Analysis SMDPs

Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:

∆(M ′,A, n) = O((

D′√

S ′ + C(M′,n, δ))√

S ′A′n log(nδ

))

with:I C(M′,n, δ) = O(Tmax) for bounded r and τ

Tmax: maximal holding time

Exploration-Exploitation in MDPs with Options SequeL - 11/22

Analysis SMDPs

Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:

∆(M ′,A, n) = O((

D′√

S ′ + C(M′,n, δ))√

S ′A′n log(nδ

))

with:I C(M′,n, δ) = τmax + Cτ

√log( nδ

)for sub-Exponential r and τ

Cτ : sub-Exponential constant

Exploration-Exploitation in MDPs with Options SequeL - 11/22

Analysis SMDPs

Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:

∆(M ′,A, n) = O((

D′√

S ′ + C(M′,n, δ))√

S ′A′n log(nδ

))

with:I C(M′,n, δ) = τmax + Cτ

√log( nδ

)for sub-Exponential r and τ

Cτ : sub-Exponential constant

Question: Can we use this bound to explain when options can speed uplearning?

Exploration-Exploitation in MDPs with Options SequeL - 11/22

Analysis MDPs with Options

Learning in MDPs with Options

ΠM : set of policies in MDP M

Exploration-Exploitation in MDPs with Options SequeL - 12/22

Analysis MDPs with Options

Learning in MDPs with Options

ΠMΠOM : set of policies over options inMDP M

Exploration-Exploitation in MDPs with Options SequeL - 12/22

Analysis MDPs with Options

Learning in MDPs with Options

ΠM

ρ∗(M)

ΠOM ρ∗O(M)

Exploration-Exploitation in MDPs with Options SequeL - 12/22

Analysis MDPs with Options

Learning in MDPs with Options

ΠM

ρ∗(M)

ΠOMρ∗O(M)ρ∗(M) ≥ ρ∗O(M)

Exploration-Exploitation in MDPs with Options SequeL - 12/22

Analysis MDPs with Options

Learning in MDPs with OptionsRemark 1

I Regret of SMDP-UCRL in the original MDP M:

∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))

Exploration-Exploitation in MDPs with Options SequeL - 13/22

Analysis MDPs with Options

Learning in MDPs with OptionsRemark 1

I Regret of SMDP-UCRL in the original MDP M:

∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))

Question: Comparison of the regret with and without options?

Exploration-Exploitation in MDPs with Options SequeL - 13/22

Analysis MDPs with Options

Learning in MDPs with OptionsRemark 1

I Regret of SMDP-UCRL in the original MDP M:

∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))

I Ratio of the regrets SMDP/MDP UCRL:

R =∆(M,SMDP-UCRL,Tn)

∆(M,MDP-UCRL,Tn)

?≤ 1

Exploration-Exploitation in MDPs with Options SequeL - 13/22

Analysis MDPs with Options

Learning in MDPs with OptionsRemark 1

I Regret of SMDP-UCRL in the original MDP M:

∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))

I Ratio of the regrets SMDP/MDP UCRL:

R =∆(M,SMDP-UCRL,Tn)

∆(M,MDP-UCRL,Tn)

?≤ 1

Bound for MDPs ((author?) [2]):

∆(M,A,Tn) = O(

DS

√ATn log

(Tnδ

))

Exploration-Exploitation in MDPs with Options SequeL - 13/22

Analysis MDPs with Options

Learning in MDPs with OptionsRemark 1

I Regret of SMDP-UCRL in the original MDP M:

∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))

I Ratio of the regrets SMDP/MDP UCRL:

R̃ ∼(

D′D +

Tmax

D√

S

)√OnATn

Exploration-Exploitation in MDPs with Options SequeL - 13/22

Analysis MDPs with Options

Learning in MDPs with OptionsRemark 1

I Regret of SMDP-UCRL in the original MDP M:

∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))

I Ratio of the regrets SMDP/MDP UCRL:

R̃ ∼(

D′D +

Tmax

D√

S

)√OnATn

∼ D′D

√OA

√n

Tn

Exploration-Exploitation in MDPs with Options SequeL - 13/22

Results

Outline

Notations and Theoretical Background

Algorithm

AnalysisSMDPsMDPs with Options

Results

Conclusion and Future Work

Exploration-Exploitation in MDPs with Options SequeL - 14/22

Results

Illustrative ExperimentsOptions with bounded holding time and reward

I Bounded optionsI A = O = 4I S ′ = S = d2

I D = 2(d − 1)

I τmax? D′?

R̃ ∼ D′D

√OA

√n

Tn

Exploration-Exploitation in MDPs with Options SequeL - 15/22

Results

Illustrative ExperimentsOptions with bounded holding time and reward

I Stochastic termination condition:

Starting State

s0 s3s2s1

1Tmax

1Tmax

1Tmax

τmax =Tmax + 1

2 and D′ ≤ D + Tmax(Tmax + 1)

R̃ ∼ D′D

√OA

√n

Tn

Exploration-Exploitation in MDPs with Options SequeL - 16/22

Results

Asymptotic bound on the ratio

1 2 3 4 5 6 7 8Maximal duration of options Tmax

0.75

0.80

0.85

0.90

0.95

1.00

1.05

1.10

Rat

ioof

regr

etsR̃

20x20 Grid25x25 Grid30x30 Grid

Exploration-Exploitation in MDPs with Options SequeL - 17/22

Results

Simulations

1 2 3 4 5 6 7 8Maximal duration of options Tmax

0.75

0.80

0.85

0.90

0.95

1.00

1.05

1.10

Rat

ioof

regr

etsR

20x20 Grid25x25 Grid30x30 Grid

Exploration-Exploitation in MDPs with Options SequeL - 18/22

Conclusion and Future Work

Outline

Notations and Theoretical Background

Algorithm

AnalysisSMDPsMDPs with Options

Results

Conclusion and Future Work

Exploration-Exploitation in MDPs with Options SequeL - 19/22

Conclusion and Future Work

Conclusion and Future WorkConclusion:

Exploration-Exploitation in MDPs with Options SequeL - 20/22

Conclusion and Future Work

Conclusion and Future WorkConclusion:

1. UCRL-based algorithm for SMDPs

Exploration-Exploitation in MDPs with Options SequeL - 20/22

Conclusion and Future Work

Conclusion and Future WorkConclusion:

1. UCRL-based algorithm for SMDPs2. Regret bound (SMDPs, MDPs with options)

Exploration-Exploitation in MDPs with Options SequeL - 20/22

Conclusion and Future Work

Conclusion and Future WorkConclusion:

1. UCRL-based algorithm for SMDPs2. Regret bound (SMDPs, MDPs with options)3. Theoretical/empirical comparison

Exploration-Exploitation in MDPs with Options SequeL - 20/22

Conclusion and Future Work

Conclusion and Future WorkConclusion:

1. UCRL-based algorithm for SMDPs2. Regret bound (SMDPs, MDPs with options)3. Theoretical/empirical comparison

Current limitations of SMDP-UCRL for options:

Exploration-Exploitation in MDPs with Options SequeL - 20/22

Conclusion and Future Work

Conclusion and Future WorkConclusion:

1. UCRL-based algorithm for SMDPs2. Regret bound (SMDPs, MDPs with options)3. Theoretical/empirical comparison

Current limitations of SMDP-UCRL for options:1. Additional knowledge about options is needed (e.g., sub-Exponential

constants)

Exploration-Exploitation in MDPs with Options SequeL - 20/22

Conclusion and Future Work

Conclusion and Future WorkConclusion:

1. UCRL-based algorithm for SMDPs2. Regret bound (SMDPs, MDPs with options)3. Theoretical/empirical comparison

Current limitations of SMDP-UCRL for options:1. Additional knowledge about options is needed (e.g., sub-Exponential

constants)2. Similarities between options and correlation between rewards and

holding times not exploited

Exploration-Exploitation in MDPs with Options SequeL - 20/22

Conclusion and Future Work

Conclusion and Future WorkConclusion:

1. UCRL-based algorithm for SMDPs2. Regret bound (SMDPs, MDPs with options)3. Theoretical/empirical comparison

Current limitations of SMDP-UCRL for options:1. Additional knowledge about options is needed (e.g., sub-Exponential

constants)2. Similarities between options and correlation between rewards and

holding times not exploited

Extensions:1. Automatic discovery of options? [estimating the diameter]

Exploration-Exploitation in MDPs with Options SequeL - 20/22

Conclusion and Future Work

Thank you for your attention

Exploration-Exploitation in MDPs with Options SequeL - 21/22

Conclusion and Future Work

Emma Brunskill and Lihong Li. Pac-inspired option discovery in lifelongreinforcement learning. In Tony Jebara and Eric P. Xing, editors,Proceedings of the 31st International Conference on MachineLearning (ICML-14), pages 316–324. JMLR Workshop andConference Proceedings, 2014.

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regretbounds for reinforcement learning. J. Mach. Learn. Res.,11:1563–1600, August 2010.

Nicholas K. Jong, Todd Hester, and Peter Stone. The utility of temporalabstraction in reinforcement learning. In The Seventh InternationalJoint Conference on Autonomous Agents and Multiagent Systems,May 2008.

Richard Sutton, Doina Precup, and Satinder Singh. Between mdps andsemi-mdps: A framework for temporal abstraction in reinforcementlearning. Artificial Intelligence, 112:181–211, 1999.

Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J. Mankowitz, andShie Mannor. A deep hierarchical approach to lifelong learning inminecraft. CoRR, abs/1604.07255, 2016.

Exploration-Exploitation in MDPs with Options SequeL - 22/22