Value Preserving State-Action Abstractions - David Abel · 2020. 8. 28. · Van Roy, 2006; Hutter,...

Value Preserving State-Action Abstractions

David Abel Nathan Umbanhowar Khimya Khetarpal

Brown University Brown University Mila-McGill University

Dilip Arumugam Doina Precup Michael L. Littman

Stanford University Mila-McGill University Brown University

Abstract

Abstraction can improve the sample e�-ciency of reinforcement learning. However,the process of abstraction inherently dis-cards information, potentially compromisingan agent’s ability to represent high-valuepolicies. To mitigate this, we here introducecombinations of state abstractions and op-tions that are guaranteed to preserve the rep-resentation of near-optimal policies. We firstdefine �-relative options, a general formalismfor analyzing the value loss of options pairedwith a state abstraction, and present neces-sary and su�cient conditions for �-relativeoptions to preserve near-optimal behavior inany finite Markov Decision Process. We fur-ther show that, under appropriate assump-tions, �-relative options can be composed toinduce hierarchical abstractions that are alsoguaranteed to represent high-value policies.

1 INTRODUCTION

Learning to make high-quality decisions in complexenvironments is challenging. To mitigate this di�-culty, knowledge can be incorporated into learning al-gorithms through inductive biases, heuristics, or otherpriors that provide information about the world. In re-inforcement learning (RL), one powerful class of suchstructures takes the form of abstractions, either of state(what are the relevant properties of the world?) oraction (what correlated, long-horizon sequences of ac-tions are useful?).

Proceedings of the 23rdInternational Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2020, Palermo,Italy. PMLR: Volume 108. Copyright 2020 by the au-thor(s).

To make RL more tractable, abstractions throw awayinformation about the environment such as irrelevantstate features or ine↵ective sequences of actions. Ifthe abstractions are too aggressive, however, they de-stroy an agent’s ability to solve tasks of interest. Thus,there is a trade-o↵ inherent in the role that abstrac-tions play: they should 1) make learning easier, while2) preserving enough information to support the dis-covery of good behavioral policies. Indeed, prior workhas illustrated the potential for abstractions to supportboth this first (Konidaris and Barto, 2007; Brunskilland Li, 2014; Bacon et al., 2017) and second property(Li et al., 2006; Van Roy, 2006; Hutter, 2014).

To realize the benefits of state and action abstraction,it is common to make use of both. One approach buildsaround MDP homomorphisms, introduced by Ravin-dran and Barto (2002), based on the earlier work ofGivan et al. (1997) and Whitt (1978). Certain classesof homomorphisms can lead to dramatic reductionsin the size of the model needed to describe the envi-ronment while preserving representation of high valuepolicies. Ravindran and Barto (2003a, 2004) extendthese ideas to semi-Markovian environments based onthe options framework (Sutton et al., 1999), illustrat-ing how approximate model reduction techniques canbe blended with hierarchical structures to form goodcompact representation. Majeed and Hutter (2019)carry out similar analysis in non-Markovian environ-ments, again proving which conditions preserve value.However, we lack a general theory of value preserv-ing state-action abstractions, especially when abstractactions are extended over multiple time steps. Moreconcretely, the following question remains open:

Which combinations of state abstractions and optionspreserve representation of near-optimal policies?

The primary contribution of this work is new analy-sis clarifying which combinations of state abstractions(�) and options (O) preserve representation of near-optimal policies in finite Markovian environments.


To perform this analysis, we first define �-relative op-tions, a general formalism for analyzing the value lossof a state abstraction paired with a set of options. Wethen prove four su�cient conditions, along with onenecessary condition, for �-relative options to preservenear-optimal behavior in any finite Markov DecisionProcess (MDP) (Bellman, 1957; Puterman, 2014). Wefurther prove that �-relative options can be composedto induce a hierarchy that preserves near-optimal be-havior under appropriate assumptions about the hi-erarchy’s construction. We suggest these results cansupport the development of principled methods thatlearn and make use of value-preserving abstractions.

1.1 Background

We next provide background on state and action ab-straction. We take the standard treatment of RL: anagent learns to make decisions that maximize valuewhile interacting with an MDP denoted by the fivetuple (S, A, R, T, �). For more on RL, see the bookby Sutton and Barto (2018), for more on MDPs, seethe book by Puterman (2014).

A state abstraction is an aggregation function thatprojects the environmental state space into a smallerone. With a smaller state space, algorithms can learnwith less computation, space, and samples (Singhet al., 1995; Dearden and Boutilier, 1997; Dietterich,2000b,a; Andre and Russell, 2002; Jong and Stone,2005). However, throwing away information aboutthe state space can restrict an agent’s ability to rep-resent good policies; as a trivial illustration of this,consider the maximally-compressing state abstractionthat maps all original states in the environment to asingle abstract state. Only well-chosen types of stateabstraction are known to preserve representation ofgood policies (Dean and Givan, 1997; Li et al., 2006;Van Roy, 2006; Hutter, 2014; Abel et al., 2016, 2019;Majeed and Hutter, 2019). We define a state abstrac-tion as follows:

Definition 1 (State Abstraction): A state abstraction� : S ! S� maps each ground state, s 2 S, into anabstract state, s� 2 S�.

Since the abstract state space is smaller than the orig-inal MDP’s state space, we expect to lose information.In the context of sequential decision making, it is nat-ural to permit this information loss so long as RL al-gorithms can still learn to make useful decisions (Abelet al., 2019; Harutyunyan et al., 2019).

Next, we recall options, a popular formalism for orga-nizing the action space of an agent.

Definition 2 (Option (Sutton et al., 1999)): An op-tion o 2 O is a triple (Io, �o, ⇡o), where Io ✓ S is a

subset of the state space denoting where the option ini-tiates; �o ✓ S, is a subset of the state space denotingwhere the option terminates; and ⇡o : S ! �(A) is apolicy prescribed by the option o.

Options denote abstract actions; the three compo-nents indicate where the option o can be executed (Io),where it terminates (�o), and what to do in betweenthese two conditions (⇡o). Options are known to aidin transfer (Konidaris and Barto, 2007, 2009; Brunskilland Li, 2014; Topin et al., 2015), encourage better ex-ploration (Simsek and Barto, 2004, 2009; Brunskill andLi, 2014; Bacon et al., 2017; Fruit and Lazaric, 2017;Machado et al., 2017; Tiwari and Thomas, 2019; Jinnaiet al., 2019), and make planning more e�cient (Mannand Mannor, 2014; Mann et al., 2015). We define anaction abstraction in terms of options as follows:

Definition 3 (Action Abstraction): An action ab-straction is any replacement of the primitive actions,A, with a set of options O.

Thus, an RL algorithm paired with an action abstrac-tion chooses from among the available options at eachtime step, runs the option until termination, and thenchooses the next option. With O replacing the prim-itive action space,1 we are no longer guaranteed tobe able to represent every policy and may destroyan agent’s ability to discover a near-optimal policy.If the primitive actions are included, any policy thatcan be represented in the original state-action spacecan still be represented. However, by including op-tions and primitive actions, learning algorithms face alarger branching factor, and must search the full spaceof policies which can hurt learning performance (Jonget al., 2008). Hence, it is often prudent to restrict theaction space only to a set of options to avoid blowingup the search space. Ideally, we would make use ofoptions that both 1) preserve representation of goodpolicies, while 2) keeping the branching factor and in-duced policy class small. Naturally, using the optionsthat always execute the optimal policy maximally sat-isfies both properties, but are challenging to obtain.Establishing a clear understanding of how this firstproperty can be satisfied under approximate knowl-edge is the primary motivation of this work.

2 RELATED WORK

The study of abstraction in RL has a long and ex-citing history, dating back to early work on approxi-mating dynamic programs by Fox (1973) and Whitt(1978, 1979), along with the early work on hierar-

1In this work, we are uninterested in degenerate op-tions (Bacon et al., 2017; Harb et al., 2018) that alias indi-vidual primitive actions (Io = �o = S, 9a : ⇡o(a | s) = 1).

Abel, Umbanhowar, Khetarpal, Arumugam, Precup, Littman

chical RL (Dayan and Hinton, 1993; Wiering andSchmidhuber, 1997; Parr and Russell, 1998; Dietterich,2000a), options (Sutton et al., 1999), and state ab-straction (Tsitsiklis and Van Roy, 1996; Dean and Gi-van, 1997; Andre and Russell, 2002; Li et al., 2006).This literature is vast; we here concentrate on workthat is focused on abstractions that aim to retain rep-resentation of near-optimal behavior. We divide thissummary into each of state, action, joint state-action,and hierarchical abstraction.

State Abstraction. The work of Whitt (1978,1979) paved the way for understanding the value lossof state abstraction in MDPs. Later, Dean and Gi-van (1997) developed a method for finding states thatbehave similarly to one another via the bisimulationproperty (Larsen and Skou, 1991). Many subsequentworks have explored bisimulation for abstraction, in-cluding defining metrics for finite (Ferns et al., 2004;Castro and Precup, 2011; Castro, 2019) and infiniteMDPs (Ferns et al., 2006; Taylor et al., 2008). In asimilar vein, Li et al. (2006) provide a unifying frame-work for state abstraction in MDPs, examining whensuch abstractions will preserve optimal behavior anda↵ect existing convergence guarantees of well-knownRL algorithms. Recent work has continued to clarifythe conditions under which state abstractions preservevalue in MDPs (Jiang et al., 2015a; Abel et al., 2016),non-Markovian environments (Hutter, 2014), planningalgorithms (Hostetler et al., 2014; Jiang et al., 2014;Anand et al., 2015) and lifelong RL (Abel et al., 2018).A related and important body of work studies theproblem of selecting a state abstraction from a givenclass (Maillard et al., 2013; Odalric-Ambrym et al.,2013; Jiang et al., 2015a; Ortner et al., 2019).

Action Abstraction. Little is known about whichkinds of options or action abstractions preserve value;this stems largely from the fact that most prior stud-ies of options consider the setting where options areadded to the primitive actions, so Aabstract = A [ O.Since the primitive actions are included, the value lossis always zero as the full space of policies is still rep-resentable. Brunskill and Li (2014) study the problemof learning options that improve sample e�ciency inlifelong RL. Under mild assumptions, their algorithmfinds options that let will match some default samplecomplexity (without options). However, in order torealize the benefits of options, it is important to con-sider the removal of primitive actions—otherwise, RLalgorithms can still represent the full space of policies,and so learning speed may su↵er (Jong et al., 2008).

A few previous works study value preservation in thecase where options replace the primitive actions. Mostrecently, Mann and Mannor (2014); Mann et al. (2015)

investigate the e↵ect of options on approximate plan-ning algorithms. Their analysis considers options thatuse "-optimal policies and are guaranteed to run fora large number of time steps, ensuring that few mis-takes are made by each option. The key result of Mannet al. (2015) characterizes the convergence rate of valueiteration using these kinds of options, with this ratedepending directly on the sub-optimality and lengthof the options. Lehnert et al. (2018) explore the im-pact of horizon length on representation of value func-tions. They find that the value loss of any policy thatoptimizes with respect to an artificially-short horizoncan achieve value similar to that of policies that takeinto account the full horizon, building on the results ofJiang et al. (2015b). Similar analysis is conducted inthe case of well connected subsets of states, parallelingour treatment of state abstractions here.

State-Action Abstraction. Together, state andaction abstractions can distill complex problems intosimple ones (Jonsson and Barto, 2001; Ciosek and Sil-ver, 2015). One popular approach builds around MDPhomomorphisms (Ravindran and Barto, 2002, 2003a,b,2004; Ravindran, 2003). MDP homomorphisms com-press the MDP by collapsing state-action pairs thatcan be treated as equivalent. The recent work by Ma-jeed and Hutter (2019) extends this analysis to non-Markovian settings, proving the existence of severalclasses of value preserving homomorphisms. The keydi↵erence between their results and the analysis wepresent here is our close attachment to the optionsformalism. Indeed, as we will show, our framework forstate-action abstractions can capture MDP homomor-phisms in a sense.

Separately, Bai and Russell (2017) develop a MonteCarlo planning algorithm that incorporates state andaction abstractions to e�ciently solve the PartiallyObservable MDP (Kaelbling et al., 1998) induced bythe abstractions. Theorem 1 from Bai and Russell(2017) shows that the value loss of their approach isbounded as a function of the state aggregation error(Hostetler et al., 2014), and Theorem 2 shows their al-gorithm converges to a recursively optimal policy forgiven state-action abstractions. Our results are closelyrelated, but capture broad classes of state-action ab-straction (Theorem 1) and target global optimality,rather than recursive optimality (Theorem 3).

Hierarchical Abstraction. Just as state and ac-tion abstractions enable a coarser view of a decision-making problem through a single lens, many ap-proaches accommodate abstraction at multiple lev-els of granularity (Dayan and Hinton, 1993; Wieringand Schmidhuber, 1997; Parr and Russell, 1998; Diet-terich, 2000a; Barto and Mahadevan, 2003; Jong and


Stone, 2008; Konidaris, 2016; Bai and Russell, 2017;Konidaris et al., 2018; Levy et al., 2019). Most re-cently, Nachum et al. (2019) introduce a scheme forpreserving near-optimal behavior in hierarchical ab-stractions. Their main result studies the kinds ofmulti-step state visitation distributions induced by dif-ferent hierarchies. In particular, they show that hierar-chies that can induce a su�ciently rich space of k-stepstate distributions can represent near-optimal behav-ior, too. Indeed, this represents the strongest existingresult for how to construct value-preserving hierarchiesbased on approximate knowledge.

3 ANALYSIS: �-RELATIVEOPTIONS

We incorporate state and action abstraction into RL asfollows. When the environment transitions to a newstate s, the agent processes s via � yielding the ab-stract state, s�. Then, the agent chooses an optionfrom among those that initiate in s�, and follow thechosen option’s policy until termination, where thisprocess repeats. In this way, an RL agent can reasonin terms of abstract state and action alone, withoutknowing the true state or action space.

To analyze the value loss of these joint abstractions,we first introduce �-relative options, a novel means ofcombining state abstractions with options.

Definition 4 (�-Relative Option): For a given �, anoption is said to be �-relative if and only if there issome s� 2 S� such that, for all s 2 S:

Io(s) ⌘ s 2 s�, �o(s) ⌘ s 62 s�, ⇡o 2 ⇧s� , (1)

where ⇧s� : {s | �(s) = s�} ! �(A) is the set ofground policies defined over states in s�, and s 2 s�

is shorthand for s 2 {s0| �(s0) = s�, s0

2 S}. Wedenote O� as any non-empty set that 1) contains only�-relative options, and 2) contains at least one optionthat initiates in each s� 2 S�.

Intuitively, these options initiate in exactly one ab-stract state and terminate when the option policyleaves the abstract state. We henceforth denote(�, O�) as a state abstraction paired with a set of �-relative options.

Example. Consider the classical Four Rooms do-main pictured in Figure 1a. Suppose further that thestate abstraction � turns each room into an abstractstate. Then any �-relative option in this domain wouldbe one that initiates anywhere in one of the rooms andterminates as soon as the agent leaves that room. Theonly degree of flexibility in grounding a set of �-relativeoptions for the given �, then, is which policies are as-

26

o1

o2

o3

o4

s� 2 S�s�

ao 2 O<latexit sha1_base64="OJ96en1OwZE/9vJzlVY5ie2ZbcU=">AAAIKXicfVVbj9w0FE7LZUq4tfDIi8UIqaDsKkl3O2ylkaptK3gAdUG7bcVkunISJ2PGl2A7ux1M/gav8M6v4Q145Y9wnGR2drJbohn75JzPx9+52EkrRrUJw79v3HzjzbfeHt16x3/3vfc/+PD2nY+eaVmrjJxkkkn1IsWaMCrIiaGGkReVIpinjDxPl4+c/fkZUZpKcWxWFZlzXApa0AwbUCUSJVSghGf2aXN6exzuhu2DrgpRL4y9/jk6vTMaJbnMak6EyRjWehaFlZlbrAzNGGn8z9DOzg46wtkSl0S7Fz+pNak6xYxjVVIxDXdjKua2JJITo1bNZYwlAtwrbEiAODYLpQsdIAgoBf48QJhrp24Fs+gUesVTEFgpFe10rKw0qYGqzMFPRQspTIB0nRa0DFBRM1bBXgHK9E+1NAR2yLFeqJqBztDlzwFKU/CTSrk0OAWze9uQqBUL0Ku2FlvkZ4XCnEBQC5lPj+nyh7nleavLGx+S01ZH+0lOCqheu97mKatJY7//6rCxURgHKIpDGCb3mm0cp1vIaAKg/QMYDg4GyLxUhIgL4JeAue/Q8f4QKBUW5dplfM9tPolgCCdDpMFrh/He/SDeC4M4HjJ0+6561MFe0P6GkPMFNRcb7kOg6wHy4yeCnGeScyxym5BKNzY5wwoEyqRwzXW8qqCPGWKUU/NgGy9ACQtqkUPzE2OhyRUtFwYrJc9d0xcGyCUO1QxW6tqthPHUUjRFUfNSDCGVkjlg3HQZBJy+hW50nADXymm6vZRnzSyaw9zh7Dga7k9FTrOmQ0Db2SuA9MmF1T4ZGhUG4ybSgZU5KyPF2rhtzaE2zSyeWz9JCZzMTbfOlIRMZlIJoqZRBYcnhRYvey2UctpWcu4nZxqan9iI88YHqgV67OpN3WXToLvj6PMHCEI0aBwDNQK7bk6Eo/OYwEWiiMvc08qde6m+sAncE5xCx8GcBE76PyB+tQaCNMgcBT95nz1FmE0OaflL0+fBLIhUhNtMih9JZmoFnfloI880CBDGfAsMf6yW0MLdfD2onxt73AvXwyC/kkFUK7fvWpz11m0oIxDSxma/ad8H9WxL4W4xiLgraAY3NVFNsljfbzbOgBZUtBvaivSgQVaogNpntTaSw6W6CaUFEXFGlRTuQ2A3GGh0H0G5FdmwSsDftit3BJKB93YZcLmifh0lSJ26nLXXsGph19Ia+BqSahdew6rT+/DpjIYfyqvCs3g3Cnej7/bGDw/7j+gt7xPvU++uF3kT76H3tXfknXiZV3m/er95v4/+GP05+mv0Twe9eaNf87G39Yz+/Q8NOupZ</latexit><latexit sha1_base64="OJ96en1OwZE/9vJzlVY5ie2ZbcU=">AAAIKXicfVVbj9w0FE7LZUq4tfDIi8UIqaDsKkl3O2ylkaptK3gAdUG7bcVkunISJ2PGl2A7ux1M/gav8M6v4Q145Y9wnGR2drJbohn75JzPx9+52EkrRrUJw79v3HzjzbfeHt16x3/3vfc/+PD2nY+eaVmrjJxkkkn1IsWaMCrIiaGGkReVIpinjDxPl4+c/fkZUZpKcWxWFZlzXApa0AwbUCUSJVSghGf2aXN6exzuhu2DrgpRL4y9/jk6vTMaJbnMak6EyRjWehaFlZlbrAzNGGn8z9DOzg46wtkSl0S7Fz+pNak6xYxjVVIxDXdjKua2JJITo1bNZYwlAtwrbEiAODYLpQsdIAgoBf48QJhrp24Fs+gUesVTEFgpFe10rKw0qYGqzMFPRQspTIB0nRa0DFBRM1bBXgHK9E+1NAR2yLFeqJqBztDlzwFKU/CTSrk0OAWze9uQqBUL0Ku2FlvkZ4XCnEBQC5lPj+nyh7nleavLGx+S01ZH+0lOCqheu97mKatJY7//6rCxURgHKIpDGCb3mm0cp1vIaAKg/QMYDg4GyLxUhIgL4JeAue/Q8f4QKBUW5dplfM9tPolgCCdDpMFrh/He/SDeC4M4HjJ0+6561MFe0P6GkPMFNRcb7kOg6wHy4yeCnGeScyxym5BKNzY5wwoEyqRwzXW8qqCPGWKUU/NgGy9ACQtqkUPzE2OhyRUtFwYrJc9d0xcGyCUO1QxW6tqthPHUUjRFUfNSDCGVkjlg3HQZBJy+hW50nADXymm6vZRnzSyaw9zh7Dga7k9FTrOmQ0Db2SuA9MmF1T4ZGhUG4ybSgZU5KyPF2rhtzaE2zSyeWz9JCZzMTbfOlIRMZlIJoqZRBYcnhRYvey2UctpWcu4nZxqan9iI88YHqgV67OpN3WXToLvj6PMHCEI0aBwDNQK7bk6Eo/OYwEWiiMvc08qde6m+sAncE5xCx8GcBE76PyB+tQaCNMgcBT95nz1FmE0OaflL0+fBLIhUhNtMih9JZmoFnfloI880CBDGfAsMf6yW0MLdfD2onxt73AvXwyC/kkFUK7fvWpz11m0oIxDSxma/ad8H9WxL4W4xiLgraAY3NVFNsljfbzbOgBZUtBvaivSgQVaogNpntTaSw6W6CaUFEXFGlRTuQ2A3GGh0H0G5FdmwSsDftit3BJKB93YZcLmifh0lSJ26nLXXsGph19Ia+BqSahdew6rT+/DpjIYfyqvCs3g3Cnej7/bGDw/7j+gt7xPvU++uF3kT76H3tXfknXiZV3m/er95v4/+GP05+mv0Twe9eaNf87G39Yz+/Q8NOupZ</latexit><latexit sha1_base64="OJ96en1OwZE/9vJzlVY5ie2ZbcU=">AAAIKXicfVVbj9w0FE7LZUq4tfDIi8UIqaDsKkl3O2ylkaptK3gAdUG7bcVkunISJ2PGl2A7ux1M/gav8M6v4Q145Y9wnGR2drJbohn75JzPx9+52EkrRrUJw79v3HzjzbfeHt16x3/3vfc/+PD2nY+eaVmrjJxkkkn1IsWaMCrIiaGGkReVIpinjDxPl4+c/fkZUZpKcWxWFZlzXApa0AwbUCUSJVSghGf2aXN6exzuhu2DrgpRL4y9/jk6vTMaJbnMak6EyRjWehaFlZlbrAzNGGn8z9DOzg46wtkSl0S7Fz+pNak6xYxjVVIxDXdjKua2JJITo1bNZYwlAtwrbEiAODYLpQsdIAgoBf48QJhrp24Fs+gUesVTEFgpFe10rKw0qYGqzMFPRQspTIB0nRa0DFBRM1bBXgHK9E+1NAR2yLFeqJqBztDlzwFKU/CTSrk0OAWze9uQqBUL0Ku2FlvkZ4XCnEBQC5lPj+nyh7nleavLGx+S01ZH+0lOCqheu97mKatJY7//6rCxURgHKIpDGCb3mm0cp1vIaAKg/QMYDg4GyLxUhIgL4JeAue/Q8f4QKBUW5dplfM9tPolgCCdDpMFrh/He/SDeC4M4HjJ0+6561MFe0P6GkPMFNRcb7kOg6wHy4yeCnGeScyxym5BKNzY5wwoEyqRwzXW8qqCPGWKUU/NgGy9ACQtqkUPzE2OhyRUtFwYrJc9d0xcGyCUO1QxW6tqthPHUUjRFUfNSDCGVkjlg3HQZBJy+hW50nADXymm6vZRnzSyaw9zh7Dga7k9FTrOmQ0Db2SuA9MmF1T4ZGhUG4ybSgZU5KyPF2rhtzaE2zSyeWz9JCZzMTbfOlIRMZlIJoqZRBYcnhRYvey2UctpWcu4nZxqan9iI88YHqgV67OpN3WXToLvj6PMHCEI0aBwDNQK7bk6Eo/OYwEWiiMvc08qde6m+sAncE5xCx8GcBE76PyB+tQaCNMgcBT95nz1FmE0OaflL0+fBLIhUhNtMih9JZmoFnfloI880CBDGfAsMf6yW0MLdfD2onxt73AvXwyC/kkFUK7fvWpz11m0oIxDSxma/ad8H9WxL4W4xiLgraAY3NVFNsljfbzbOgBZUtBvaivSgQVaogNpntTaSw6W6CaUFEXFGlRTuQ2A3GGh0H0G5FdmwSsDftit3BJKB93YZcLmifh0lSJ26nLXXsGph19Ia+BqSahdew6rT+/DpjIYfyqvCs3g3Cnej7/bGDw/7j+gt7xPvU++uF3kT76H3tXfknXiZV3m/er95v4/+GP05+mv0Twe9eaNf87G39Yz+/Q8NOupZ</latexit><latexit sha1_base64="OJ96en1OwZE/9vJzlVY5ie2ZbcU=">AAAIKXicfVVbj9w0FE7LZUq4tfDIi8UIqaDsKkl3O2ylkaptK3gAdUG7bcVkunISJ2PGl2A7ux1M/gav8M6v4Q145Y9wnGR2drJbohn75JzPx9+52EkrRrUJw79v3HzjzbfeHt16x3/3vfc/+PD2nY+eaVmrjJxkkkn1IsWaMCrIiaGGkReVIpinjDxPl4+c/fkZUZpKcWxWFZlzXApa0AwbUCUSJVSghGf2aXN6exzuhu2DrgpRL4y9/jk6vTMaJbnMak6EyRjWehaFlZlbrAzNGGn8z9DOzg46wtkSl0S7Fz+pNak6xYxjVVIxDXdjKua2JJITo1bNZYwlAtwrbEiAODYLpQsdIAgoBf48QJhrp24Fs+gUesVTEFgpFe10rKw0qYGqzMFPRQspTIB0nRa0DFBRM1bBXgHK9E+1NAR2yLFeqJqBztDlzwFKU/CTSrk0OAWze9uQqBUL0Ku2FlvkZ4XCnEBQC5lPj+nyh7nleavLGx+S01ZH+0lOCqheu97mKatJY7//6rCxURgHKIpDGCb3mm0cp1vIaAKg/QMYDg4GyLxUhIgL4JeAue/Q8f4QKBUW5dplfM9tPolgCCdDpMFrh/He/SDeC4M4HjJ0+6561MFe0P6GkPMFNRcb7kOg6wHy4yeCnGeScyxym5BKNzY5wwoEyqRwzXW8qqCPGWKUU/NgGy9ACQtqkUPzE2OhyRUtFwYrJc9d0xcGyCUO1QxW6tqthPHUUjRFUfNSDCGVkjlg3HQZBJy+hW50nADXymm6vZRnzSyaw9zh7Dga7k9FTrOmQ0Db2SuA9MmF1T4ZGhUG4ybSgZU5KyPF2rhtzaE2zSyeWz9JCZzMTbfOlIRMZlIJoqZRBYcnhRYvey2UctpWcu4nZxqan9iI88YHqgV67OpN3WXToLvj6PMHCEI0aBwDNQK7bk6Eo/OYwEWiiMvc08qde6m+sAncE5xCx8GcBE76PyB+tQaCNMgcBT95nz1FmE0OaflL0+fBLIhUhNtMih9JZmoFnfloI880CBDGfAsMf6yW0MLdfD2onxt73AvXwyC/kkFUK7fvWpz11m0oIxDSxma/ad8H9WxL4W4xiLgraAY3NVFNsljfbzbOgBZUtBvaivSgQVaogNpntTaSw6W6CaUFEXFGlRTuQ2A3GGh0H0G5FdmwSsDftit3BJKB93YZcLmifh0lSJ26nLXXsGph19Ia+BqSahdew6rT+/DpjIYfyqvCs3g3Cnej7/bGDw/7j+gt7xPvU++uF3kT76H3tXfknXiZV3m/er95v4/+GP05+mv0Twe9eaNf87G39Yz+/Q8NOupZ</latexit> ⇡o

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⇡�,O<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⇡�,O : S ! A<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

26

o1

o2

o3

o4



⇡�,O(s) =

��

��

⇡o1(s) s 2

⇡o2(s) s 2

⇡o3(s) s 2

⇡o4(s) s 2<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

26

o1

o2

o3

o4



26

o1

o2

o3

o4



26

o1

o2

o3

o4



(a) Assignment of options to each s� via ⇡O� .

37

o1

o2

o3

o4

26

o1

o2

o3

o4



⇡�,O(s) =

��

��

⇡o1(s) s 2

⇡o2(s) s 2

⇡o3(s) s 2

⇡o4(s) s 2<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

26

o1

o2

o3

o4



26

o1

o2

o3

o4



26

o1

o2

o3

o4



⇡+

�,O�(s)

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⇡+

O�(s) =

��

��

⇡o1(s), s 2

⇡o2(s), s 2

⇡o3(s), s 2

⇡o4(s), s 2

<latexit sha1_base64="NWNmS7Ssekn6w/r1bWBD1v2JvgU=">AAAMQ3icjVbdbts2FHba/XTKftrtcjfcgg3t4ASWbCdxgQBdm2K76NZsSH/QyDUoiZa5SKRGUkk8Tq+wN9jj7CH2DLsbdjtg50hybMlNMAMWyXM+nj+ec8ggS7g2vd6fGzduvvX2O+/ees/ZfP+DDz+6fefj51rmKmTPQplI9TKgmiVcsGeGm4S9zBSjaZCwF8HpI+S/OGNKcymOzTxj45TGgk95SA2QJrd/9zM+sX4a2qfFxM9mvHht/UN5LqhS8ry4q++RA+IHLObChqBHF065Q05cZHbJl5r4XBDfX9C9K+j9K+iDNToTUa1rcnurt9Mrf2R94taTrU79O5rccX7zIxnmKRMmTKjWJ24vM2NLleFhwgrnC7K9vU2OaHhKY6Zx4fi5ZllFOEmpAk8PejseF2MbM5kyo+bFKsYyAeIVNaxLUmpmSk91l0CMAwhp2iU01UguJ2ZWEfQ8DWCSxFLxipbEmWY5mCojkJPxqRSmS3QeTHncJdM8STLQ1SWh/jmXhoGGiOqZyhOgGX76S5cEAcgJpDw1NAA2rpZG5CrpkosyPRrGn0wVTRk4NZPRwTE/fTW2aVTSosKpgrNNyqzRhIqIfI+6S6rjR2wKSVbKtFGQ5KywP37zsLCj/S5xBz34jHpFC7aKc3seYDwE7vXbwHQVuDtCYf0u8dxRG5isAD13FzD9IXyGwzZQgUu1vBEgXBe1u2uKpaIivhTYR+tK9X2vjYwVY6IG7vfRlb0u2V/Tm67iXBeDs7uH6DVk0kAOB+g0WjrcayMNXcA8z8WwoMu9tWhnXJwucD3UuYvx6e23cXOWJFDcNXKIwgZo52gt2rGi8xo3AFmDEf7LTHnC45nR1xzO9af4v11v2opGeJhrngcn6Ti+YOehTFPIVeuzTBfWP6MKJjyRAqv9eJ5Br0tIwlNu7jfxAoiwIRcRNEhmLHQfhV6VrQ+70dSA8z6iitZOneNO+E4shwbpFq9FG5IpCQno47AKApu+g/aANmGB4TwImlvTsDhxxzBWOLvltvVzEfGwqBDQB+waIHh8ybWP20xFgbn0tMVNkJuw6YLZ5EZwOMWJN7ZOfSlcto8TJSGSoVSCqQM3g24WQM+Jayqc5cH5jBs2dvwzDd2IWTdNCwdMnZJDPHCOF1JB7m659+4TcNGQLSxBvAyWLQrNOWTQ2RXDyD3NsBFL9ZX1oXGnHDIKRr+Ls+uA9GIBhFkrchzkRHX0FEus/5DHvxbtOJQuYDsGZH07wpXDVOHPFo3aeiEkCUSi+lTXWgWqpZkZk4qllguIWZhrI1O4HQp7XNEry5g440oKvNHsEgMJ4hAIk2JLq3yQ1xSFqeO3pJfbwJY18lUmwZGqwj6SSiYQzvlVVpWwN5rVktU2qtz4BqsqOtTLIT1j5JjCk2ZNNXRGIJdqF+cAF2iugDSDTKvCzUW8yFaA5+BEUdpgZFaeIG6VBjTiqsq4Ba5a1SJLY45yQyiZUhHO4fa9ILRMcGJmoEXvtIoFAGUpX18sAyyW0nGwqH6BIK0kVZa1qWX1tolYtCu05RKaDxJWNtWUdoViu/9s2C5RiNUb6hB6a5UrupE48ObK06ysZfv1ct7AhFL8xEKDMYW8upy3MIt8a6feJSKh2L8flUODwy5ommFpPq4nDW7CIJyFfVIODQ706kzqqhHZo5VFAwV/qvCercYGrx6XJQwvWLf9Xl2fPPd23MHO8IfB1oOH9Vv2VufTzuedux23s9d50Pm2c9R51gk3bm7c2/A2+pt/bP61+ffmPxX0xka955NO47f573/Kpl6v</latexit>

26

o1

o2

o3

o4



26

o1

o2

o3

o4



26

o1

o2

o3

o4



26

o1

o2

o3

o4



(b) Construction of ⇡+

O�.

Figure 1: Grounding policy ⇡O� to ⇡+

O�.

sociated with each option, and how many options areavailable in each abstract state. If, for instance, theoptimal policy ⇡⇤ were chosen for an option in thetop right room, but the uniform random policy wereavailable everywhere else, how might that impact theoverall suboptimality of the policies induced by the ab-straction? Our main result (Theorem 1) clarifies theprecise conditions under which "-optimal policies arerepresentable under di↵erent abstractions.

To analyze the value loss of these pairs, we first showthat any (�, O�) gives rise to an abstract policy overS� and O� that induces a unique policy in the origi-nal MDP (over the entire state space). Critically, thisproperty does not hold for arbitrary options due totheir semi-Markovian nature.

Proofs of all introduced Theorems and Remarks arepresented in Appendix A.

Remark 1. Every deterministic policy defined overabstract states and �-relative options, ⇡O� : S� ! O�,induces a unique Markov policy in the ground MDP,⇡+

O�: S ! �(A). We let ⇧O� denote the set of ab-

stract policies representable by the pair (�, O�), and

⇧+

O�denote the corresponding set of policies in the

original MDP.

This remark gives us a means of translating a policyover �-relative options into a policy over the originalstate and action space, S and A. Consequently, wecan define the value loss associated with a set of op-tions paired with a state abstraction: every (�, O�)

pair yields a set of policies in the original MDP, ⇧+

O�.

We define the value loss of (�, O�) as the value loss ofthe best policy in this set.


Definition 5 ((�, O�)-Value Loss): The value loss of(�, O�) is the smallest degree of sub-optimality achiev-able:

L(�, O�) := min⇡O�

2⇧O�

��

��V⇤

� V⇡+

O�

��

��1

. (2)

Note that this notion of value loss is not well de-fined for options in general, since they induce a semi-MDP: there is no well-formed ground value functionof a policy over options, but rather, a semi-Markovvalue function. As a simple illustration, consider aground state sg, two options o1 and o2 (either of whichcould be executing in sg), and a policy ⇡�,o over ab-stract states and options. It could be that o1 or o2 iscurrently executing when sg is entered or that eitheroption has just terminated, requiring ⇡�,o to select anew option. Each of these three cases induces a dis-tinct value V ⇡�,o(sg) which is then di�cult to distillinto a single ground value function. This is a key rea-son to restrict attention to �-relative options, each ofwhich retains structure that couples with the corre-sponding state abstraction � to yield value functionsin the ground MDP.

3.1 Main Results

We next show how di↵erent classes of �-relative op-tions can represent near-optimal policies. We definean option class by a predicate � : O� 7! {0, 1}, andsay that a set of �-relative options O� belongs to theclass O�,� if and only if �(O�) = 1.

We begin by summarizing the four new �-relative op-tion classes, drawing inspiration from other forms ofabstraction (Dean and Givan, 1997; Li et al., 2006;Jiang et al., 2015a; Abel et al., 2016; Ravindran andBarto, 2004; Nachum et al., 2019). For each class, wewill refer to the optimal option in s�, o⇤

s�, as the �-

relative option which initiates in s� and executes ⇡⇤

until termination. These classes were chosen as theyclosely parallel existing properties studied in the liter-ature. The four classes are as follows:

1. Similar Q⇤ Functions: In each s�, there is at leastone option o that has similar Q⇤ to o⇤

s�.

2. Similar Models: In each s�, there is at least oneoption o that has a similar multi-time model (Pre-cup and Sutton, 1998) to o⇤

s�.

3. Similar k-Step Distributions: In each s�, thereis at least one option o that has a similar k-steptermination state distribution to o⇤

s�, based o↵ the

hierarchical construction introduced by Nachumet al. (2019). Loss bounds will only apply to goal-based MDPs.

4. Approximate MDP Homomorphisms: We showthat any deterministic ⇡O� can encode an MDPhomomorphism. The MDP homomorphism op-tion class is defined by a guarantee on the qualityof the resulting homomorphism.

Our main result establishes the bounded value loss ofpairs (�, O�) where O� belongs to any of these fourclasses, and the size of the bound depends on the de-gree of approximation ("Q; "R, "T ; ⌧ ; and "r, "p).

Theorem 1. (Main Result) For any �, the fourintroduced classes of �-relative options satisfy:

L(�, O�,Q⇤")

"Q

1 � �, (3)

L(�, O�,M") "R + |S|"TRMax

(1 � �)2, (4)

L(�, O�,⌧ ) ⌧�|S|

(1 � �)2, (5)

L(�, O�,H) 2

1 � �

✓"r +

�RMax

1 � �

"p

2

◆, (6)

where RMax is an upper bound on the rewardfunction, the L(�, O�,⌧ ) bound holds in goal-basedMDPs and the other three hold in any finite MDP.

Observe that when the approximation parameters arezero, many of the bounds collapse to 0 as well. Thisillustrates the trade-o↵ made between the amount ofknowledge used to construct the abstractions and thedegree of optimality ensured. Further note that thevalue loss of the state abstraction does not appear inany of the above bounds—indeed, � will implicitly af-fect the value loss as a function of the diameter of eachabstract state.

We now present each class in full technical detail. Asstated, the first two classes guarantee " closeness ofvalues and models respectively. More concretely:

Similar Q⇤-Functions (O�,Q⇤

"): The "-similar Q⇤

predicate defines an option class where:

�(O�) ⌘ (7)

8s�2S�9o2O� : maxs2s�

|Q⇤

s�(s, o⇤

s�) � Q⇤

s�(s, o)| "Q,

where

Q⇤

s�(s, o) := R(s, ⇡o(s)) + �

X

s02S

T (s0| s, ⇡o(s))

⇣(s0

2 s�)Q⇤

s�(s0, o) + (s0

62 s�)V ⇤(s0)⌘

. (8)

This Q-function describes the expected return of start-ing in state s, executing a �-relative option o until leav-ing �(s), then following the optimal policy thereafter.


More generally, this class of �, O� pairs captures allcases where each abstract state has at least one optionthat is useful. Note that the identity state abstractionpaired with the degenerate set of options that exactlyencodes the execution of each primitive action will nec-essarily be an instance of this class.

Similar Models (O�,M"): The "-similar T and Rpredicate defines an option class where:

�(O�) ⌘ 8s�2S�9o2O� : (9)��T s

0

s,o⇤s�

� T s0

s,o

��1

"T and��Rs,o⇤

s�� Rs,o

��1

"R,

where Rs,o and T s0

s,oare shorthand for the reward

model and multi-time model of Sutton et al. (1999).Roughly, this class states that there is at least one op-tion in each abstract state that behaves similarly to theoptimal option in that abstract state, o⇤

s�, throughout

its execution in the abstract state.

We next derive two classes of �-relative options basedon abstraction formalisms from existing literature.The first is based on the hierarchical construction in-troduced by Nachum et al. (2019), while the secondshows that �-relative options can describe an MDPhomomorphism (Ravindran and Barto, 2004).

Similar k-Step Distributions (O�,⌧): Let P(s0, k |

s, o) denote the probability of option o terminating ins0 after k steps, given that it initiated in s. We definethis class by the following predicate:

�(O�) ⌘ 8s�2S�9o2O�8k : (10)

maxs2s�,s02S

|P(s0, k | s, o⇤

s�) � P(s0, k | s, o)| ⌧.

For the next class definition we first define the one-step abstract transition and reward functions for a �-relative option o:

T�(s0

�| s�, o) =

X

s2s�

w(s)X

s02s0

�

T (s0| s, ⇡o(s)), (11)

R�(s�, o) =X

s2s�

w(s)R(s, ⇡o(s)), (12)

where w(s) is any valid weighting function withPs2s�

w(s) = 1. Next, we define the quantities

of Ravindran and Barto (2004), related to the sim-ulation lemma of Kearns and Singh (2002):

Kp = maxs2S,a2A

X

s�2S�

��X

s02s�

T (s0| s, a) � T�(s� | �(s), ⇡O�(�(s)))

��,

Kr = maxs2S,a2A

|R(s, a) � R�(�(s), ⇡O�(�(s)))|. (13)

These capture the maximum discrepancy between themodel of the ground MDP and the model of the in-duced abstract MDP defined according to (�, O�).Then, we define the class as follows.

Approximate MDP Homomorphisms (O�,H):

This class requires that O� represents policies that in-duce good approximate homomorphisms:

�(O�) ⌘ 8 ⇡O� 2 ⇧O� : Kp "p and Kr "r.

These four classes constitute four su�cient conditionsfor (�, O�) pairs to yield bounded value loss. It is use-ful, however, to identify not just su�cient conditions,but also necessary. To this end, we next establish onenecessary condition of all (globally) value preserving(�, O�) classes.

Theorem 2. For any (�, O�) pair with L(�, O�) ⌘,there exists at least one option per abstract state thatis ⌘-optimal in Q-value. Precisely, if L(�, O�) ⌘,then:

8s�2S�8s2s�9o2O� : Q⇤

s�(s, o⇤

s�) � Q⇤

s�(s, o) ⌘. (14)

This theorem tells us that for any agent acting us-ing our joint abstraction framework, if there exists anabstract state for which there is not an ⌘-optimal op-tion, then the agent cannot represent a globally near-optimal policy.

3.2 Experiment

We next conduct a simple experiment to corroboratethe findings of our analysis, and to test whether valuepreserving options enable simple RL algorithms to findnear-optimal policies. The experiment illustrates animportant property of one of the introduced �, O�

classes, and is organized as follows. We first constructa �, O� pair belonging to the O�,Q⇤

"class using dy-

namic programming. We give this pair to an instanceof Double Q-Learning (Hasselt, 2010) with a variedsample budget N . The environment is a Four Roomsgrid world MDP, with a single goal location in the topright and start location in the bottom left. The stateabstraction � maps each state into one of four abstractstates, denoting each of the four rooms. We vary boththe number of options added per abstract state (|O�|)and the sample budget of Double Q (N), and presentthe value of the policy discovered by the final episodefor each setting of |O�| and N .

Results are presented in Figure 2a. First, note thatwith only one option per abstract state, Double Q cantrivially find a near-optimal policy, even with a smallsample budget. This aligns with our theory: the in-cluded options preserve value, and so any assignment


1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0|Oϕ|

0.0

0.2

0.4

0.6

0.8V

π ϕ,O(V0)

OraQdomOϕ,Q *

ε,N=1e2

Oϕ,Q *ε,N=1e4

Oϕ,Q *ε,N=1e6

(a) Learning With Value PreservingOptions

Value function visualization - May 21st - paper figures

Q Learning with Phi-Relative Options Q Learning

(b) Learned Value Functions for Q-Learning (left) and Q-Learning with �,O� (right)

Figure 2: Empirical evidence that the �, O� pairs from Theorem 1 preserve value (left) and comparison of thelearned value function with Q-Learning run in the ground MDP and run in the abstract MDP of �, O� (right).On the left, the y-axis denotes the value of the policy learned by Double Q-Learning for the given sample budget(N) and option set, averaged over 25 runs with 95% confidence interval. Optimal behavior is shown in red. Onthe right, the stars indicate potential goal locations (those that were used for constructing the options), with thered star the currently active one. Brighter color indicates a higher estimated value, with purple a value of 0.

of options to abstract states will yield a near-optimalpolicy. In contrast, if randomly chosen options areused instead (shown in blue, labeled as Orandom), thelearning algorithm fails to find a good policy even witha high sample budget (N = 1e6 was used). Second, wefind that as the number of options increases, the addedbranching factor causes Double Q to find a lower valuepolicy with the same number of samples. However,by Theorem 1 we know each set of options preservesvalue; as the sample budget increases we see that thevalue of the discovered policy tends toward optimal.In short, the �, O� pairs defined by Theorem 1 doin fact preserve value, but will also a↵ect the sam-ple budget required to find a good policy. We foreseethe combination of value preserving abstractions withthose that lower learning complexity (see recent workby Brunskill and Li (2014); Fruit et al. (2017)) as akey direction for future work.

We further visualize the learned value function of Q-Learning with and without �-relative options after thesame sample budget, depicted in Figure 2b. Notably,since �-relative options update entire blocks of states,we see large regions of the state space with the samelearned value function. Conversely, Q-Learning onlytends to explore (and estimate the values of) a narrowregion of the state space. The visual highlights thisimportant qualitative di↵erence between learning withand without abstractions. More experiments, details,and a link to the code are found in Appendix B.

4 HIERARCHICAL ABSTRACTION

We next present an extension of Theorem 1 that ap-plies to hierarchies consisting of (�, O�) pairs. We

show the value loss compounds linearly if we con-struct a hierarchy using algorithms that generate awell-behaved � and O�. To do so, we require two def-initions and additional notation (see Table 1 in Ap-pendix A for a summary of hierarchy notation). Wefirst define a hierarchy as n sets of (�, O�) pairs.

Definition 6 ((�, O�)-Hierarchy): A depth n hierar-chy, denoted Hn, is a list of n state abstractions, �(n),

and a list of n sets of �-relative options, O(n)�

. Thecomponents (I, �, ⇡) of each of the i-th set of options,O�,i are defined over the (i�1)-th abstract state spaceS�,i�1 = {�i�1(�i�2(. . . �1(s) . . .)) | s 2 S}.

We next introduce additional notation to refer to val-ues, states, options, and policies at each level of thehierarchy. We denote ⇡n : S�,n ! O�,n as the level npolicy encoded by the hierarchy, with ⇧n the space ofall policies encoded in this way. We let si := �i(s) =�i(. . . �1(s) . . .), with s a state in the ground MDP.We further denote Vi as the i-th level’s value function,defined as follows for some ground state s:

V ⇡

i(s) := V ⇡

i

��i(s)

�= V ⇡

i(si) =

maxo2Oi

Ri (si, o) +

X

s02Si

Ti (s0| si, o) V ⇡

i(s0)

!, (15)

where:

Ri(si, o) :=X

si�12si

wi(si�1)Rsi�1,o (16)

Ti(s0

i| si, o) :=

X

si�12si

X

s0

i�12Si�1

wi(si�1)Ts0

i�1si�1,o (17)

where again Rs,o and T s0

s,oare defined according to the

multi-time model (Sutton et al., 1999), si 2 S�,i is a


level i state resulting from �i(s), and wi is an aggre-gation weighting function for level i. Note that V0 isthe ground value function, which we refer to as V forsimplicity.

4.1 Hierarchy Analysis

We now extend Theorem 1 to hierarchies of arbitrarydepth, building on two key observations. First, anypolicy ⇡n represented at the top level of a hierarchyHn also has a unique Markov policy in the groundMDP, which we denote ⇡+

n(in contrast to ⇡#

n, which

moves the level n policy to level n�1). We summarizethis fact in the following remark:

Remark 2. Every deterministic policy ⇡i defined bythe i-th level of a hierarchy, Hn, induces a unique pol-icy in the ground MDP, which we denote ⇡+

i.

To be precise, note that ⇡#

ispecifies the level i policy

⇡i mapped into level ⇡i�1, whereas ⇡+

irefers to the

policy at ⇡i mapped into ⇡0. The second key insight isthat we can extend our same notion of value loss from(�, O�) pairs to hierarchies, Hn.

Definition 7 (Hn-Value Loss): The value loss of adepth n hierarchy Hn is the smallest degree of subopti-mality across all policies representable at the top levelof the hierarchy:

L(Hn) := min⇡n2⇧n

��V ⇤

� V ⇡+

n

��1

. (18)

This quantity denotes how suboptimal the best hi-erarchical policy is in the ground MDP. Therefore, theguarantee we present expresses a condition on globaloptimality rather than recursive or hierarchical opti-mality (Dietterich, 2000a).

We next show that there exist value-preserving hier-archies by bounding the above quantity for well con-structed hierarchies. To prove this result, we requiretwo assumptions.

Assumption 1. The value function is consistentthroughout the hierarchy. That is, for every level ofthe hierarchy i 2 [1 : n], for any policy ⇡i over statesS�,i and options O�,i, there is a small 2 R�0 suchthat:

maxs2S

��V⇡#

ii�1

��i�1(s)

�� V ⇡i

i

��i(s)

�� (19)

Assumption 2. Subsequent levels of the hierarchy canrepresent policies similar in value to the best policy atthe previous level. That is, for every i 2 [1 : n � 1],

letting ⇡⇧

i= arg min

⇡i2⇧i||V ⇤

0 � V⇡+

i0 ||1, there is a

small ` 2 R�0 such that:

min⇡#

i+12⇧#

i+1

��

��V⇡⇧

ii

� V⇡#

i+1

i

��

��1

`. (20)

We strongly suspect that both assumptions are truegiven the right choice of state abstractions, options,and methods of constructing abstract MDPs. Thesetwo assumptions (along with Theorem 1) give rise tohierarchies that can represent near-optimal behavior.We present this fact through the following theorem:

Theorem 3. Consider two algorithms: 1) A�: givenan MDP M , outputs a �, and 2) AO� : given M anda �, outputs a set of options O such that there areconstants and ` for which Assumption 1 and As-sumption 2 are satisfied.

Then, by repeated application of A� and AO� , we canconstruct a hierarchy of depth n such that

L(Hn) n( + `). (21)

This theorem o↵ers a clear path for extending theguarantees of �-relative options beyond the typicaltwo-timescale setup observed in recent work (Baconet al., 2017; Nachum et al., 2019) to fully realize thebenefits of (multi-level) hierarchical abstraction (Levyet al., 2019). Moreover, both Assumption 1 and As-sumption 2 are su�cient—together with �-relative op-tions that satisfy Theorem 1—to construct a hierarchywith low value loss. We conclude that algorithms forleveraging hierarchies may want to explicitly searchfor structures that satisfy our assumptions: 1) valuefunction smoothness up and down the hierarchy, and2) policy richness at each level of the hierarchy.

5 DISCUSSION

We proved which state-action abstractions are guaran-teed to preserve representation of high value policies.To do so, we introduced �-relative options, a simplebut expressive formalism for combining state abstrac-tions with options. Under this formalism, we proposedfour classes of �-relative options with bounded valueloss. Lastly, we proved that under mild conditions,pairs of state-action abstractions can be recursivelycombined to induce hierarchies that also possess near-optimality guarantees.

We take these results to serve as a concrete path to-ward principled abstraction discovery and use in RL.To realize this goal, abstractions need to both preservegood behavior, and lower sample complexity. Ourresults thus far focus on the representation of near-optimal policies, which is a key condition for ensuringagents can eventually behave well. However, learningagents also need to represent functions of intermediatequality as they learn. Thus, an important direction forfuture work is to clarify which kinds of abstractionsensure that near-optimal policies are easily learnable.


Acknowledgements

We would like to thank Cam Allen, Philip Amor-tila, Will Dabney, Mark Ho, George Konidaris, OfirNachum, Scott Niekum, Silviu Pitis, and BalaramanRavindran for insightful discussions, and the anony-mous reviewers for their thoughtful feedback. Thiswork was supported by a grant from the DARPA L2Mprogram and an ONR MURI grant.

References

David Abel, D. Ellis Hershkowitz, and Michael L.Littman. Near optimal behavior via approximatestate abstraction. In Proceedings of the Interna-tional Conference on Machine Learning, 2016.

David Abel, Dilip Arumugam, Lucas Lehnert, andMichael L. Littman. State abstractions for lifelongreinforcement learning. In Proceedings of the Inter-national Conference on Machine Learning, 2018.

David Abel, Dilip Arumugam, Kavosh Asadi, Yuu Jin-nai, Michael L. Littman, and Lawson L.S. Wong.State abstraction as compression in apprenticeshiplearning. In Proceedings of the AAAI Conference onArtificial Intelligence, 2019.

Ankit Anand, Aditya Grover, Parag Singla, et al. Anovel abstraction framework for online planning. InProceedings of the International Conference on Au-tonomous Agents and Multiagent Systems, 2015.

David Andre and Stuart Russell. State abstractionfor programmable reinforcement learning agents. InProceedings of the AAAI Conference on ArtificialIntelligence, 2002.

Pierre-Luc Bacon, Jean Harb, and Doina Precup. Theoption-critic architecture. In Proceedings of theAAAI Conference on Artificial Intelligence, 2017.

Aijun Bai and Stuart Russell. E�cient reinforcementlearning with hierarchies of machines by leveraginginternal transitions. In Proceedings of the Inter-national Joint Conference on Artificial Intelligence,2017.

Andrew G. Barto and Sridhar Mahadevan. Recent ad-vances in hierarchical reinforcement learning. Dis-crete event dynamic systems, 13(1-2):41–77, 2003.

Richard Bellman. A Markovian decision process. Jour-nal of Mathematics and Mechanics, pages 679–684,1957.

Emma Brunskill and Lihong Li. PAC-inspired optiondiscovery in lifelong reinforcement learning. In Pro-ceedings of the International Conference on MachineLearning, 2014.

Pablo Samuel Castro. Scalable methods for comput-ing state similarity in deterministic markov decision

processes. In Proceedings of the AAAI Conferenceon Artificial Intelligence, 2019.

Pablo Samuel Castro and Doina Precup. Automaticconstruction of temporally extended actions forMDPs using bisimulation metrics. In Proceedingsof the European Workshop on Reinforcement Learn-ing, 2011.

Kamil Ciosek and David Silver. Value iteration withoptions and state aggregation. ICAPS Workshop onPlanning and Learning, 2015.

Peter Dayan and Geo↵rey E Hinton. Feudal reinforce-ment learning. In Advances in Neural InformationProcessing Systems, 1993.

Thomas Dean and Robert Givan. Model minimizationin Markov decision processes. In Proceedings of theAAAI Conference on Artificial Intelligence, 1997.

Richard Dearden and Craig Boutilier. Abstraction andapproximate decision-theoretic planning. ArtificialIntelligence, 89(1):219–283, 1997.

Thomas G. Dietterich. Hierarchical reinforcementlearning with the MAXQ value function decompo-sition. Journal of Artificial Intelligence Research,2000a.

Thomas G. Dietterich. State abstraction in MAXQhierarchical reinforcement learning. In Advances inNeural Information Processing Systems, 2000b.

Norm Ferns, Prakash Panangaden, and Doina Precup.Metrics for finite Markov decision processes. In Pro-ceedings of the Conference on Uncertainty in Artifi-cial Intelligence, 2004.

Norman Ferns, Pablo Samuel Castro, Doina Precup,and Prakash Panangaden. Methods for computingstate similarity in Markov decision processes. InProceedings of the Conference on Uncertainty in Ar-tificial Intelligence, 2006.

Bennett L Fox. Discretizing dynamic programs. Jour-nal of Optimization Theory and Applications, 11(3):228–234, 1973.

Ronan Fruit and Alessandro Lazaric. Exploration–exploitation in MDPs with options. In Proceedingsof the International Conference on Artificial Intelli-gence and Statistics, 2017.

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, andEmma Brunskill. Regret minimization in MDPswith options without prior knowledge. In Advancesin Neural Information Processing Systems, 2017.

Robert Givan, Sonia Leach, and Thomas Dean.Bounded parameter Markov decision processes. InEuropean Conference on Planning. Springer, 1997.

Jean Harb, Pierre-Luc Bacon, Martin Klissarov, andDoina Precup. When waiting is not an option:


Learning options with a deliberation cost. In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence, 2018.

Anna Harutyunyan, Will Dabney, Diana Borsa, Nico-las Heess, Remi Munos, and Doina Precup. The ter-mination critic. In Proceedings of the InternationalConference on Artificial Intelligence and Statistics,2019.

Hado Van Hasselt. Double Q-learning. In Advances inNeural Information Processing Systems, 2010.

Jesse Hostetler, Alan Fern, and Thomas G. Dietterich.State aggregation in MCTS. In Proceedings of theAAAI Conference on Artificial Intelligence, 2014.

Marcus Hutter. Extreme state aggregation beyondMDPs. In Proceedings of the International Confer-ence on Algorithmic Learning Theory, 2014.

Nan Jiang, Satinder Singh, and Richard Lewis. Im-proving uct planning via approximate homomor-phisms. In Proceedings of the International Confer-ence on Autonomous Agents and Multi-Agent Sys-tems, 2014.

Nan Jiang, Alex Kulesza, and Satinder Singh. Ab-straction selection in model-based reinforcementlearning. In Proceedings of the International Con-ference on Machine Learning, 2015a.

Nan Jiang, Alex Kulesza, Satinder Singh, and RichardLewis. The dependence of e↵ective planning horizonon model accuracy. In Proceedings of the Interna-tional Conference on Autonomous Agents and Mul-tiagent Systems, 2015b.

Yuu Jinnai, Jee Won Park, David Abel, and GeorgeKonidaris. Discovering options for exploration byminimizing cover time. In Proceedings of the Inter-national Conference on Machine Learning, 2019.

Nicholas K. Jong and Peter Stone. State abstractiondiscovery from irrelevant state variables. In Pro-ceedings of the International Joint Conference onArtificial Intelligence, 2005.

Nicholas K. Jong and Peter Stone. Hierarchical model-based reinforcement learning: R- max+MAXQ. InProceedings of the International Conference on Ma-chine Learning, 2008.

Nicholas K. Jong, Todd Hester, and Peter Stone.The utility of temporal abstraction in reinforcementlearning. In Proceedings of the International Con-ference on Autonomous Agents and Multiagent Sys-tems, 2008.

Anders Jonsson and Andrew G. Barto. Automatedstate abstraction for options using the U-tree algo-rithm. In Advances in Neural Information Process-ing Systems, 2001.

Leslie Pack Kaelbling, Michael L. Littman, and An-thony R. Cassandra. Planning and acting in par-tially observable stochastic domains. Artificial in-telligence, 101(1-2):99–134, 1998.

Michael Kearns and Satinder Singh. Near-optimal re-inforcement learning in polynomial time. Machinelearning, 49(2-3):209–232, 2002.

George Konidaris. Constructing abstraction hierar-chies using a skill-symbol loop. In Proceedings ofthe International Joint Conference on Artificial In-telligence, 2016.

George Konidaris and Andrew G. Barto. Buildingportable options: Skill transfer in reinforcementlearning. In Proceedings of the International JointConference on Artificial Intelligence, 2007.

George Konidaris and Andrew G. Barto. Skill discov-ery in continuous reinforcement learning domainsusing skill chaining. In Advances in Neural Infor-mation Processing Systems, 2009.

George Konidaris, Leslie Pack Kaelbling, and TomasLozano-Perez. From skills to symbols: Learn-ing symbolic representations for abstract high-levelplanning. Journal of Artificial Intelligence Research,2018.

Kim G. Larsen and Arne Skou. Bisimulation throughprobabilistic testing. Information and computation,94(1):1–28, 1991.

Lucas Lehnert, Romain Laroche, and Harm van Sei-jen. On value function representation of long horizonproblems. In Proceedings of the AAAI Conferenceon Artificial Intelligence, 2018.

Andrew Levy, George Konidaris, Robert Platt, andKate Saenko. Learning multi-level hierarchies withhindsight. In Proceedings of the International Con-ference on Learning Representations, 2019.

Lihong Li, Thomas J. Walsh, and Michael L. Littman.Towards a unified theory of state abstraction forMDPs. In Proceedings of the International Sym-posium on Artificial Intelligence and Mathematics,2006.

Marios C Machado, Marc G Bellemare, and MichaelBowling. A laplacian framework for option discoveryin reinforcement learning. In Proceedings of the In-ternational Conference on Machine Learning, 2017.

Odalric-Ambrym Maillard, Phuong Nguyen, RonaldOrtner, and Daniil Ryabko. Optimal regret boundsfor selecting the state representation in reinforce-ment learning. In International Conference on Ma-chine Learning, pages 543–551, 2013.

Sultan Javed Majeed and Marcus Hutter. Performanceguarantees for homomorphisms beyond Markov de-


cision processes. In Proceedings of the AAAI Con-ference on Artificial Intelligence, 2019.

Timothy A. Mann and Shie Mannor. Scaling up ap-proximate value iteration with options: Better poli-cies with fewer iterations. In Proceedings of the In-ternational Conference on Machine Learning, 2014.

Timothy A. Mann, Shie Mannor, and Doina Precup.Approximate value iteration with temporally ex-tended actions. Journal of Artificial Intelligence Re-search, 2015.

Ofir Nachum, Shixiang Gu, Honglak Lee, and SergeyLevine. Near-optimal representation learning for hi-erarchical reinforcement learning. In Proceedings ofthe International Conference on Learning Represen-tations, 2019.

Maillard Odalric-Ambrym, Phuong Nguyen, RonaldOrtner, and Daniil Ryabko. Optimal regret boundsfor selecting the state representation in reinforce-ment learning. In Proceedings of the InternationalConference on Machine Learning, 2013.

Ronald Ortner, Matteo Pirotta, Alessandro Lazaric,Ronan Fruit, and Odalric-Ambrym Maillard. Re-gret bounds for learning state representations in re-inforcement learning. In Advances in Neural Infor-mation Processing Systems, 2019.

Ronald Parr and Stuart Russell. Reinforcement learn-ing with hierarchies of machines. In Advances inNeural Information Processing Systems, 1998.

Doina Precup and Richard S. Sutton. Multi-time mod-els for temporally abstract planning. In Advances inNeural Information Processing Systems, 1998.

Martin L. Puterman. Markov decision processes: dis-crete stochastic dynamic programming. John Wiley& Sons, 2014.

Balaraman Ravindran. SMDP homomorphisms: Analgebraic approach to abstraction in semi Markovdecision processes. PhD thesis, University of Mas-sachusetts Amherst, 2003.

Balaraman Ravindran and Andrew G. Barto. Modelminimization in hierarchical reinforcement learn-ing. In International Symposium on Abstraction,Reformulation, and Approximation, pages 196–211.Springer, 2002.

Balaraman Ravindran and Andrew G. Barto. Rela-tivized options: Choosing the right transformation.In Proceedings of the International Conference onMachine Learning, 2003a.

Balaraman Ravindran and Andrew G. Barto. SMDPhomomorphisms: An algebraic approach to abstrac-tion in semi-Markov decision processes. In Proceed-ings of the International Joint Conference on Arti-ficial Intelligence, 2003b.

Balaraman Ravindran and Andrew G. Barto. Approx-imate homomorphisms: A framework for non-exactminimization in Markov decision processes. In Pro-ceedings of the International Conference on Knowl-edge Based Computer Systems, 2004.

Ozgur Simsek and Andrew G. Barto. Using relativenovelty to identify useful temporal abstractions inreinforcement learning. In Proceedings of the Inter-national Conference on Machine Learning, 2004.

Ozgur Simsek and Andrew G. Barto. Skill characteri-zation based on betweenness. In Advances in NeuralInformation Processing Systems, 2009.

Satinder Singh, Tommi Jaakkola, and Michael I. Jor-dan. Reinforcement learning with soft state aggrega-tion. In Advances in neural information processingsystems, 1995.

Richard S. Sutton and Andrew G. Barto. Reinforce-ment learning: An introduction. MIT press, 2018.

Richard S. Sutton, Doina Precup, and Satinder Singh.Between MDPs and semi-MDPs: A framework fortemporal abstraction in reinforcement learning. Ar-tificial Intelligence, 1999.

Jonathan Taylor, Doina Precup, and Prakash Pana-gaden. Bounding performance loss in approximateMDP homomorphisms. In Advances in Neural In-formation Processing Systems, 2008.

Saket Tiwari and Philip S. Thomas. Natural optioncritic. In Proceedings of the AAAI Conference onArtificial Intelligence, 2019.

Nicholay Topin, Nicholas Haltmeyer, Shawn Squire,John Winder, James MacGlashan, et al. Portableoption discovery for automated learning transfer inobject-oriented Markov decision processes. In Pro-ceedings of the International Joint Conference onArtificial Intelligence, 2015.

John N Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic program-ming. Machine Learning, 22(1-3):59–94, 1996.

Benjamin Van Roy. Performance loss bounds for ap-proximate value iteration with state aggregation.Mathematics of Operations Research, 31(2):234–244,2006.

Ward Whitt. Approximations of dynamic programs, i.Mathematics of Operations Research, 3(3):231–243,1978.

Ward Whitt. Approximations of dynamic programs, ii.Mathematics of Operations Research, 4(2):179–185,1979.

Marco Wiering and Jurgen Schmidhuber. HQ-learning. Adaptive Behavior, 6(2):219–246, 1997.

Value Preserving State-Action Abstractions - David Abel · 2020. 8. 28. · Van Roy, 2006; Hutter,...

Documents

Transcript of Value Preserving State-Action Abstractions - David Abel · 2020. 8. 28. · Van Roy, 2006; Hutter,...