Reinforcement Learning When All Actions are Not Always ... · motivate the question we aim to...

Reinforcement Learning When All Actions are Not Always Available

Yash Chandak1 Georgios Theocharous2 Blossom Metevier1 Philip S. Thomas1

1University of Massachusetts Amherst, 2Adobe Research{ychandak,bmetevier,pthomas}@cs.umass.edu [email protected]

Abstract

The Markov decision process (MDP) formulation used tomodel many real-world sequential decision making problemsdoes not efficiently capture the setting where the set of avail-able decisions (actions) at each time step is stochastic. Re-cently, the stochastic action set Markov decision process (SAS-MDP) formulation has been proposed, which better capturesthe concept of a stochastic action set. In this paper we arguethat existing RL algorithms for SAS-MDPs can suffer frompotential divergence issues, and present new policy gradientalgorithms for SAS-MDPs that incorporate variance reductiontechniques unique to this setting, and provide conditions fortheir convergence. We conclude with experiments that demon-strate the practicality of our approaches on tasks inspired byreal-life use cases wherein the action set is stochastic.

IntroductionIn many real-world sequential decision making problems,the set of available decisions, which we call the actionset, is stochastic. In vehicular routing on a road network(Gendreau, Laporte, and Seguin 1996) or packet routing onthe internet (Ribeiro, Sidiropoulos, and Giannakis 2008),the goal is to find the shortest path between a sourceand destination. However, due to construction, traffic, orother damage to the network, not all pathways are alwaysavailable. In online advertising (Tan and Srikant 2012;Mahdian, Nazerzadeh, and Saberi 2007), the set of avail-able ads can vary due to fluctuations in advertising budgetsand promotions. In robotics (Feng and Yan 2000), actuatorscan fail. In recommender systems (Harper and Skiba 2007),the set of possible recommendations can vary based on prod-uct availability. These examples capture the broad idea andmotivate the question we aim to address: how can we developefficient learning algorithms for sequential decision makingproblems wherein the action set can be stochastic?

Sequential decision making problems without stochasticaction sets are typically modeled as Markov decision pro-cesses (MDPs). Although the MDP formulation is remarkablyflexible, and can incorporate concepts like stochastic state

Copyright c© 2020, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

transitions, partial observability, and even different (deter-ministic) action availability depending on the state, it cannotefficiently incorporate stochastic action sets. As a result, algo-rithms designed for MDPs are not well suited to our setting ofinterest. Recently, Boutilier et al. (2018) laid the foundationsfor stochastic action set Markov decision processes (SAS-MDPs), that extends MDPs to include stochastic action sets.They also showed how the Q-learning and value iterationalgorithms, two classic algorithms for approximating optimalsolutions to MDPs, can be extended to SAS-MDPs.

In this paper we show that the lack of convergence guaran-tees of the Q-learning algorithm, when using function approx-imators in the MDP setting can potentially get exacerbatedin the SAS-MDP setting. We therefore derive policy gradientand natural policy gradient algorithms for the SAS-MDP set-ting and provide conditions for their almost-sure convergence.Critically, since the introduction of stochastic action sets in-troduces further uncertainty in the decision making process,variance reduction techniques are of increased importance.We therefore derive new approaches to variance reduction forpolicy gradient algorithms that are unique to the SAS-MDPsetting. We validate our new algorithms empirically on tasksinspired by real-world problems with stochastic action sets.

Related WorkWhile there is extensive literature on solving sequential de-cision problems modeled as MDPs (Sutton and Barto 2018),there are few methods designed to handle stochastic actionsets. Recently, Boutilier et al. (2018) laid the foundation forstudying MDPs with stochastic action sets by defining thenew SAS-MDP problem formulation, which we review inthe background section. After defining SAS-MDPs, Boutilieret al. (2018) presented and analyzed the model-based valueiteration and policy iteration algorithms and the model-freeQ-learning algorithm for SAS-MDPs.

In the bandit setting, wherein individual decisions are opti-mized rather than sequences of dependent decisions, sleepingbandits extend the standard bandit problem formulation to al-low for stochastic action sets (Kanade, McMahan, and Bryan2009; Kleinberg, Niculescu-Mizil, and Sharma 2010). Wefocus on the SAS-MDP formulation rather than the sleepingbandit formulation because we are interested in sequential

arX

iv:1

906.

0177

2v2

[cs

.LG

] 2

0 Ja

n 20

20

problems. Such sequential problems are more challenging be-cause making optimal decisions requires one to reason aboutthe long-term impact of decisions, which includes reason-ing about how a decision will influence the probability thatdifferent actions (decisions) will be available in the future.

Although we focus on the model-free setting, whereinthe dynamics of the environment are not known a pri-ori to the agent optimizing its decisions, in the alternativemodel-based setting researchers have considered related prob-lems in the area of stochastic routing (Papadimitriou andYannakakis 1991; Polychronopoulos and Tsitsiklis 1996;Nikolova, Brand, and Karger 2006; Nikolova and Karger2008). In stochastic routing problems, the goal is to find ashortest path on a graph with stochastic availability of edges.The SAS-MDP framework generalizes stochastic routingproblems by allowing for sequential decision making prob-lems that are not limited to shortest path problems.

BackgroundMDPs and SAS-MDPs (Boutilier et al. 2018) are mathemat-ical formulations of sequential decision problems. Beforedefining SAS-MDPs, we define MDPs. We refer to the entityinteracting with an MDP or SAS-MDP and trying to optimizeits decisions as the agent.

Formally, an MDP is a tuple M = (S,B,P,R, γ, d0).S is the set of all possible states that the agent can be in,called the state set. Although our math notation assumes thatS is countable, our primary results extend to MDPs withcontinuous states. B is a finite set of all possible actions thatthe agent can take, called the base action set. St and At arerandom variables that denote the state of the environmentand action chosen by the agent at time t ∈ {0, 1, . . . }. P iscalled the transition function and characterizes how statestransition: P(s, a, s′) := Pr(St+1 = s′|St = s,At = a).Rt ∈ [−Rmax, Rmax], a bounded random variable, is thescalar reward received by the agent at time t, where Rmax is afinite constant.R is called the reward function, and is definedas R(s, a) := E[Rt|St = s,At = a]. The reward discountparameter, γ ∈ [0, 1), characterizes how to utility of rewardsto the agent decays based on how far in the future they occur.We call d0 the start state distribution, which is defined asd0(s) := Pr(S0 = s).

We now turn to defining a SAS-MDP. Let the set of actionsavailable at time t be a random variable, At ⊆ B, which weassume is always not empty, i.e., At 6= ∅. Let ϕ character-ize the conditional distribution of At: ϕ(s, α) := Pr(At =α|St = s). We assume that At is Markovian, in that its dis-tribution is conditionally independent of all events prior tothe agent entering state St given St. Formally, a SAS-MDPis M′ = {M ∪ ϕ}, with the additional requirement thatAt ∈ At.

A policy π : S × 2B × B → [0, 1] is a conditional dis-tribution over actions for each state: π(s, α, a) := Pr(At =a|St = s,At = α) for all s ∈ S, a ∈ α, α ⊆ B, and t, whereα 6= ∅. Sometimes a policy is parameterized by a weight vec-tor θ, such that changing θ changes the policy. We write πθto denote such a parameterized policy with weight vector θ.For any policy π, we define the corresponding state-actionvalue function to be qπ(s, a) := E[

∑∞k=0 γ

kRt+k|St =

s,At = a, π], where conditioning on π denotes that At+k ∼π(St+k,At+k, ·) for all At+k and St+k for k ∈ [t + 1,∞).Similarly, the state-value function associated with policyπ is vπ(s) := E[

∑∞k=0 γ

kRt+k|St = s, π]. For a givenSAS-MDP M′, the agent’s goal is to find an optimal pol-icy, π∗, (or equivalently optimal policy parameters θ∗) whichis any policy that maximizes the expected sum of discountedfuture rewards. More formally, an optimal policy is anyπ∗ ∈ argmaxπ∈ΠJ(π), where J(π) := E[

∑∞t=0 γ

tRt|π]and Π denotes the set of all possible policies. For notationalconvenience, we sometimes use θ in place of π, e.g., to writevθ, qθ, or J(θ), since a weight vector θ induces a specificpolicy.

As shown by Boutilier et al. (2018), one way to modelstochastic action sets using the MDP formulation (rather thanthe SAS-MDP formulation) is to define states such that onecan inferAt from St. Transforming an MDP into a new MDPwith At embedded in St in this way can result in the size ofthe state set growing exponentially— by a factor of 2|B|. Thisdrastic increase in the size of the state set can make findingor approximating an optimal policy prohibitively difficult.Using the SAS-MDP formulation, the challenges associatedwith this exponential increase in the size of the state set canbe avoided, and one can derive algorithms for finding orapproximating optimal policies in terms of the state set ofthe original underlying MDP. This is accomplished using avariant of the Bellman operator, T , which incorporates theconcept of stochastic action sets:

T πv(s) =∑α∈2B

ϕ(s, α)∑a∈α

π(s, α, a)( ∑s′∈S

P (s, a, s′)

(R(s, a) + γv(s′)))

(1)

for all s ∈ S . Similarly, one can extend the Bellman optimal-ity operator (Sutton and Barto 2018):

T ∗v(s) =∑α∈2B

ϕ(s, α) maxa∈α

∑s′∈S

P (s, a, s′)(R(s, a) + γv(s′)).

Boutilier et al. (2018) showed that the stationary optimalpolicies exists for SAS-MDPs and can be represented using(state-specific) decision lists (or orderings/rankings) over theaction set. As a policy takes into account the available set ofactions, an optimal policy chooses the highest ranked actionfrom those that are available. Building upon these results,Boutilier et al. (2018) proposed the following update for atabular estimate, q, of qπ

∗:

q(St, At)← (1− η)q(St, At) + η(Rt + γmaxa∈At+1

q(St+1, a)).

(2)

Notice that the maximum is computed only over the availableactions, At+1, in state St+1. We refer to the algorithm usingthis update rule as SAS-Q-learning.

Potential Limitations of SAS-Q-LearningAlthough SAS-Q-learning provides a powerful first model-free algorithm for approximating optimal policies for SAS-MDPs, it inherits several of the drawbacks of the Q-learning

Figure 1: θ → 2θ MDP

algorithm for MDPs. Just like Q-learning, in a state Stand with available actions At, the SAS-Q-learning methodchooses actions deterministically when not exploring: At ∈arg maxa∈At q(St, a). This limits its practicality for prob-lems where optimal policies are stochastic, which is often thecase when the environment is partially observable or whenthe use of function approximation causes state aliasing (Baird1995). Additionally, if the SAS-Q-learning update convergesto an estimate, q, of qπ

∗such that T v(s) = v(s) for all

s ∈ S, then the agent will act optimally; however, conver-gence to a fixed-point of T is seldom achieved in practice,and reducing the difference between v(s) and T v(s) (whatSAS-Q-learning aims to do) does not ensure improvement ofthe policy (Sutton and Barto 2018).

SAS-Q-learning does not perform gradient ascent or de-scent on any function, and it can cause divergence of theestimator q when using function approximation, just like Q-learning for MDPs (Baird 1995). In the setting where allactions are always available, SAS-Q-learning reduces to stan-dard Q-learning. Therefore, for all the cases in this settingwhere Q-learning is unstable, SAS-Q-learning is also unsta-ble. In the setting where all actions are not always available,there exist additional cases where Q-learning is stable butSAS-Q-learning is not. However, in such cases where Q-learning is stable, its solution might not be particularly usefulas it does not incorporate the notion of stochasticity in theaction set (Section 8, Fig.2, Boutilier et al. 2018).

To see this, consider the SAS variant of the classicalθ → 2θ MDP (Tsitsiklis and Roy 1983) illustrated inFigure 1. In this example there are two states, s1 (left inFigure 1) and s2 (right), and two actions, a1 = left anda2 = right. The agent in this example uses function ap-proximation (Sutton and Barto 2018), with weight vectorθ ∈ R2, such that q(s1, a1) = θ1, q(s2, a1) = 2θ1 andq(s1, a2) = θ2, q(s2, a2) = 2θ2. In either state, if the agenttakes the left action, it goes to the left state, and if theagent takes the right action, it goes to the right state. Inour SAS-MDP version of this problem, both actions are notalways available. Let Rt = 0 always, and γ = 1. Con-sider the case where the weights of the q-approximationare initialized to θ = [−2,−5]. Now suppose that a transi-tion is observed from the left state to the right state, andafter the transition the left action is not available to theagent. As per the SAS-Q-learning update rule provided in (2),θ2 ← θ2 + η(r+ γ2θ2 − θ2). Since r = 0 and γ = 1, this isequivalent to θ2 ← θ2 + ηθ2. Considering the off-policy set-ting where this transition is used repeatedly on its own, thenirrespective of the learning rate, η > 0, the weight θ woulddiverge to −∞. In contrast, had there been no constraint ofusing max over q given the available actions, the Q-learningupdate would have been, θ2 ← θ2+η(r+γ2θ1−θ2) becauseaction a1 has higher q-value than a2 due to θ1 > θ2. This

would make θ2 converge to the value −4 (the correct answeris 0).

This provides an example of how the stochastic constraintson the set of available actions can be instrumental in causingthe SAS-Q-learning method to diverge, and ignoring thestochastic constraint can prevent Q-learning from convergingto the correct solution. We suspect more such cases can beconstructed by adapting examples from non-SAS setup (Baird 1995, Gordon 1996, Chpt 11.2 Sutton and Barto 2018).

Policy Gradient Methods for SAS-MDPsIn this section we provide an alternative to the SAS-Q-learning algorithm by deriving policy gradient algorithms(Sutton et al. 2000) for the SAS-MDP setting. While the Q-learning algorithm minimizes the error between T v(s) andv(s) for all states s (using a procedure that is not a gradi-ent algorithm), policy gradient algorithms perform stochasticgradient ascent on the objective function J . That is, they usethe update θ ← θ + η∆, where ∆ is an unbiased estimatorof∇J(θ).

Unlike the Q-learning algorithm, policy gradient algo-rithms for MDPs provide convergence guarantees to a crit-ical point (local/global optima) even when using functionapproximation, and can approximate optimal stochastic poli-cies. However, ignoring the fact that actions are not al-ways available and using off-the-shelf algorithms for MDPsfails to fully capture the problem setting (Boutilier et al.2018). It is therefore important that we derive policy gra-dient algorithms that are appropriate for the SAS-MDPsetting, as they provide the first convergent model-freealgorithms for SAS-MDPs when using function approx-imation. In the following lemma we extend the expres-sion for the policy gradient for MDPs (Sutton et al. 2000;Thomas 2014) to handle stochastic action sets.Lemma 1 (SAS Policy Gradient). For a SAS-MDP, for alls ∈ S,

∇J(θ) =

∞∑t=0

∑s∈S

γt Pr(St = s|θ)( ∑α∈2B

ϕ(s, α)

∑a∈α

qθ(s, a)∂πθ(s, α, a)

∂θ

).

Proof. See Appendix A.

It follows from Lemma 1 that we can create unbiasedestimates of ∇J(θ), which can be used to update θ usingthe well-known stochastic gradient ascent algorithm. Thisalgorithm is presented in Algorithm 12. Notably, this processdoes not require the agent to know ϕ. Also, similar to theSAS-Q-learning method, the policy can be parameterizedsuch that it is not required to embed the available actions asa part of the state. One such parameterization is provided inAppendix F. Notice that in the special case where all actionsare always available, the expression in Lemma 1 degeneratesto the policy gradient theorem for MDPs (Sutton and Barto2018). We now establish that SAS policy gradient algorithmsare guaranteed to converge to locally optimal policies underthe following standard assumptions,

Assumption A1 (Differentiable). For any state, action-set,and action triplet (s, α, a), policy πθ(s, α, a) is continuouslydifferentiable in the parameter θ.Assumption A2 (Lipschitz smooth gradient). Let Θ denotethe set of all possible parameters for policy πθ, then for someconstant L,

‖∇J(θ)−∇J(θ)‖ ≤ L‖θ − θ‖ ∀θ, θ ∈ Θ.

Assumption A3 (Learning rate schedule). Let ηtθ be thelearning rate for updating policy parameters θ, then,

∞∑t=0

ηtθ =∞,∞∑t=0

(ηtθ)2 <∞.

All the assumptions (A1-A3) are satisfied under standardpolicy parameterization techniques (linear-function/neural-networks with softmax) and appropriately set learning rates.Lemma 2. Under Assumptions (A1)-(A3), the SAS policygradient algorithm causes ∇J(θt) → 0 as t → ∞, withprobability one.

Proof. See Appendix B.

Natural policy gradient algorithms (Kakade 2002) extendpolicy gradient algorithms to follow the natural gradient of J(Amari 1998). In essence, whereas policy gradient methodsperform gradient ascent in the space of policy parameters bycomputing the gradient of J as a function of the parametersθ, natural policy gradient methods perform gradient ascentin the space of policies (which are probability distributions)by computing the gradient of J as a function of the policy, π.Thus, whereas policy gradient implicitly measures distancesbetween policies by the Euclidean distance between theirpolicy parameters, natural policy gradient methods measuredistances between policies using notions of distance betweenprobability distributions. In the most common form of naturalpolicy gradients, the distances between policies are measuredusing a Taylor approximation of Kullback-Leibler divergence(KLD). By performing gradient ascent in the space of policiesrather than the space of policy parameters, the natural policygradient becomes invariant to how the policy is parameter-ized (Thomas, Dann, and Brunskill 2018), which can help tomitigate the vanishing gradient problem in neural networksand improve learning speed (Amari 1998).

The natural policy gradient (using a Taylor approxima-tion of KLD to measure distances) is ∇J(θ) := F−1

θ ∇J(θ)where Fθ is the Fisher information matrix (FIM) associatedwith the policy πθ. Although the FIM is a well-known quan-tity, it is typically associated with a parameterized proba-bility distribution. Here, πθ is a collection of probabilitydistributions—one per state. This raises the question of whatFθ should be when computing the natural policy gradient.Following the work of Bagnell and Schneider (2003) forMDPs, we show that the FIM, Fθ, for computing the naturalpolicy gradient for a SAS-MDP can also be derived by view-ing πθ as a distribution over possible trajectories (sequencesof states, available action sets and executed actions).

Property 1 (Fisher Information Matrix). For a pol-icy, parameterized using weights θ, let ψθ(s, α, a) :=∂ log πθ(s, α, a)/∂θ, then the Fisher information matrix is,

Fθ =

∞∑t=0

∑s∈S

γt Pr(St = s|θ)∑α∈2B

(ϕ(s, α)

∑a∈α

πθ(s, α, a)ψθ(s, α, a)ψθ(s, α, a)>).

Proof. See Appendix C.

Furthermore, Kakade (2002) showed that many terms inthe definition of the natural policy gradient cancel, providinga simple expression for the natural gradient which can beestimated with time linear in the number of policy parametersper time step. We extend the result of Kakade (2002) to theSAS-MDP formulation in the following lemma:Lemma 3 (SAS Natural Policy Gradient). Let w be a param-eter such that,

∂

∂wE

[1

2

∞∑t

γt(ψθ(St,At, At)>w − qθ(St, At)

)2]= 0,

then for all s ∈ S inM′, ∇J(θ) = w.

Proof. See Appendix C.

From Lemma 3, we can derive a computationally efficientnatural policy gradient algorithm by using the well-knowntemporal difference algorithm (Sutton and Barto 2018), mod-ified to work with SAS-MDPs, to estimate qθ with the ap-proximator ψθ(St,At, At)>w, and then using the updateθ ← θ + ηw. This algorithm, which is the SAS-MDPequivalent of NAC-TD (Bhatnagar et al. 2008; Degris, Pi-larski, and Sutton 2012; Morimura, Uchibe, and Doya 2005;Thomas and Barto 2012), is provided in Algorithm 2 in Ap-pendix E.

Adaptive Variance MitigationIn the previous section, we derived (natural) policy gradientalgorithms for SAS-MDPs. While these algorithms avoidthe divergence of SAS-Q-learning, they suffer from the highvariance of policy gradient estimates (Kakade and others2003). As a consequence of the additional stochasticity thatresults from stochastic action sets, this problem can be evenmore severe in the SAS-MDP setting. In this section, weleverage insights from the Bellman equation for SAS-MDPs,provided in (1), to reduce the variance of policy gradientestimates.

One of the most popular methods to reduce variance is theuse of a state-dependent baseline b(s). Sutton et al. (2000)showed that, for any state-dependent baseline b(s):

∇J(θ) = E

[ ∞∑t=0

γtψθ(s, α, a)

(qθ(s, a)− b(s)

)]. (3)

For any random variables X and Y , we know that the vari-ance of X − Y is given by var(X − Y ) = var(X) +

var(Y ) − 2cov(X,Y ), where cov stands for covariance.Therefore, the variance of X − Y is lesser than varianceof X if 2cov(X,Y ) > var(Y ). As a result, any state de-pendent baseline b(s) whose value is sufficiently corre-lated to the expected return, qθ(s, a), can be used to re-duce the variance of the sample estimator of (3). A base-line dependent on both the state and action can have highercorrelation with qθ(s, a), and could therefore reduce vari-ance further. However, such action dependent baselines can-not be used directly, as they can result in biased gradi-ent estimates. Developing such baselines remains an activearea of research for MDPs (Thomas and Brunskill 2017;Grathwohl et al. 2017; Liu et al. 2017; Wu et al. 2018;Tucker et al. 2018) and is largely complementary to ourpurpose. Further, even the optimal state-dependent baseline(Greensmith, Bartlett, and Baxter 2004), which leads to theminimum variance gradient estimator, is not feasible to com-pute and only under certain restrictive assumptions reducesto the common choice of state-value function estimator, v(s).Therefore, in the following, we propose multiple baselinesthat are easy to compute, and then combine them optimally.

We now introduce a baseline for SAS-MDPs that lies be-tween state-dependent and state-action-dependent baselines.Like state-dependent baselines, these new baselines do notintroduce bias into gradient estimates. However, like action-dependent baselines these new baselines include some infor-mation about the chosen actions. Specifically, we proposebaselines that depend on the state, St, and available actionset At, but not the precise action, At.

Recall from the SAS Bellman equation (1) that the state-value function for SAS-MDPs can be written as, vθ(s) =∑

α∈2B ϕ(s, α)∑a∈α π

θ(s, α, a)qθ(s, a). While we cannotdirectly use a baseline dependent on the action sampled fromπθ, we can use baseline dependent on the sampled actionset. We consider a new baseline which leverages this in-formation about the sampled action set α. This baseline isq(s, α) :=

∑a∈α π

θ(s, α, a)q(s, a), where q is a learned es-timator of the state-action value function, and q represents itsexpected value under the current policy, πθ, conditioned onthe sampled action set α.

In principle, we expect q(St,At) to be more correlatedwith qθ(St, At) as it explicitly conditions on the action setand does not compute an average over all action sets possible,like v. Practically, however, estimating q values can be harderthan estimating v. This can be attributed to the fact that withthe same number of training samples, the number of parame-ters to learn in q is more than those in an estimate of vθ. Thisposes a new dilemma of deciding when to use which baseline.To get the best of both, we consider using a weighted combi-nation of v(St) and q(St,At). In the following property weestablish that using any weighted combination of these twobaselines results in an unbiased estimate of the SAS policygradient.

Property 2 (Unbiased estimator). Let J(s, α, a, θ) :=ψθ(s, α, a)

(qθ(s, a) + λ1v(s) + λ2q(s, α)

)and dπ(s) :=

(1 − γ)∑∞t γt Pr(St = s), then for any values of λ1 ∈ R

Algorithm 1: Stochastic Action Set Policy Gradient(SAS-PG)

1 A = [λ1, λ2]> = [−0.5,−0.5]> . Initialize λ’s2 for episode = 0, 1, 2... do

# Collect transition batch using πθ3 B = {(s0, α0, a0, r0), ..., (sT , αT , aT , rT )}4 G(st) =

∑T−tk=0 γ

krt+k

# Perform update on parameters using batch B5 ψθ(s, α, a) = ∂ log πθ(s,α,a)

∂θ

6 $ ← $ + η$(G(s)− v$(s))∂v$(s)∂$

7 ω ← ω + ηω(G(s)− qω(s, α))∂qω(s,α)∂ω

8 θ ← θ + ηθ(G(s) + λ1v$(s) +

λ2qω(s, α))ψθ(s, α, a) . Update πθ

# Automatically tune hyper-parameters forvariance reduction using B

9 B = [ψθ(s, α, a)v$(s), ψθ(s, α, a)qω(s, α)]>

10 C = [ψθ(s, α, a)G(s)]>

11 A← −(E[B>B])−1E[B>C]

12 A← ηλA + (1− ηλ)A . Update λ’s

and λ2 ∈ R,

∇J(θ) = E[J(s, α, a, θ)

∣∣∣dπ, ϕ, π] .Proof. See Appendix D.

The question remains: what values should be used forλ1 and λ2 for combining v and q ? Similar problems ofcombining different estimators has been studied in statisticsliterature (Graybill and Deal 1959; Meir and others 1994)and more recently for combining control variates (Wang et al.2013; Geffner and Domke 2018). Building upon their ideas,rather than leaving λ1 and λ2 as open hyperparameters, wepropose a method for automatically adapting A = [λ1, λ2]for the specific SAS-MDP and current policy parameters, θ.The following lemma presents an analytic expression for thevalue of A that minimizes a sample-based estimate of thevariance of J .

Lemma 4 (Adaptive variance mitigation). If A = [λ1, λ2]>,B = [ψθ(s, α, a)v(s), ψθ(s, α, a)q(s, α)]>, and C =[ψθ(s, α, a)qθ(s, a)]>, where A ∈ R2×1,B ∈ Rd×2, andC ∈ Rd×1, then the A that minimizes the variance of J isgiven by

A = −(E[B>B

])−1 E[B>C

]. (4)

Proof. See Appendix D.

Lemma 4 provides the values for λ1 and λ2 that resultin the minimal variance of J . Note that the computationalcost associated with evaluating the inverse of E

[B>B

]is

negligible because its dimension is always R2×2, indepen-dent of the number of policy parameters. Also, Lemma 4

Figure 2: (Top) Best performing learning curves on the domains considered. The probability of any action being available in theaction set is 0.8. (Bottom) Autonomously adapted values of λ1 and λ2 associated with v and q, respectively, for the SAS-PGresults. Shaded regions correspond to one standard deviation obtained using 30 trials.

provides the optimal values of λ1 and λ2, which still mustbe approximated using sample-based estimates of B and C.Furthermore, one might use double sampling for B to getunbiased estimates of the variance minimizing value of A(Baird 1995). However, as Property 2 ensures that estimatesof J for any value of λ1 and λ2 are always unbiased, we optto use all the available samples for estimating E[B>B] andE[B>C].

AlgorithmPseudo-code for the SAS policy gradient algorithm is pro-vided in Algorithm 12. Let the estimators of vθ and qθ be v$and qω , which are parameterized using $ and ω, respectively.Let πθ corresponds to the policy parameterized using θ. Letη$, ηω, ηθ and ηλ be the learning-rate hyper-parameters. Webegin by initializing the λ values to −0.5 each, such that ittakes an average of both the baselines and subtracts it offfrom the sampled return. In Lines 3 and 4, we execute πθ toobserve the trajectory and compute the return. Lines 6 and7 correspond to the updates for parameters associated withv$ and qω , using their corresponding TD errors (Sutton andBarto 2018). The policy parameters are then updated usinga combination of both the baselines. We drop the γt depen-dency for data efficiency (Thomas 2014). As per Lemma 4,for automatically tuning the values of λ1 and λ2, we createthe sample estimates of the matrices B and C using the tran-sitions from batch B, in Lines 9 and 10. To update the valuesof λ’s, we compute A using the sample estimates of E[B>B]and E[B>C]. While computing the inverse, a small diagonalnoise is added to ensure that inverse exists. As everythingis parameterized using smooth function, we know that thesubsequent estimates of A should not vary a lot. Since weonly have access to the sample estimate of A, we leveragethe Polyak-Rupert averaging in Line 12 for stability. Dueto space constraints, the algorithm for SAS natural policy

gradient is deferred to Appendix E.

Empirical AnalysisIn this section we use empirical studies to answer the fol-lowing three questions: (a) How do our proposed algorithms,SAS policy gradient (SAS-PG) and SAS natural policy gra-dient (SAS-NPG), compare to the prior method SAS-Q-learning? (b) How does our adaptive variance reduction tech-nique weight the two baselines over the training duration? (c)What impact does the probability of action availability haveon the performances of SAS-PG, SAS-NPG, and SAS-Q-learning? To evaluate these aspects, we first briefly introducethree domains inspired by real-world problems.

Routing in San Francisco. This task models the problemof finding shortest paths in San Francisco, and was firstpresented with stochastic actions by Boutilier et al. (2018).Stochastic actions model the concept that certain paths in theroad network may not be available at certain times. A positivereward is provided to the agent when it reaches the destina-tion, while a small penalty is applied at every time step. Wemodify the domain presented by Boutilier et al. (2018) sothat the starting state of the agent is not one particular node,but rather is uniformly randomly chosen among all possiblelocations. This makes the problem more challenging, since itrequires the agent to learn the shortest path from every node.All the states (nodes) are discrete, and edges correspond tothe action choices. Each edge is made available with somefixed probability. The overall map is shown in Appendix.

Robot locomotion task in a maze. In this domain, theagent has to navigate a maze using unreliable actuators. Theagent starts at the bottom left corner and a goal reward isgiven when it reaches the goal position, marked by a star (see

Figure 3: Best performances of different algorithms across different values of probabilities for action availability. The error barscorrespond to one standard deviation obtained using 30 trials.

Appendix for the figure). The agent is penalized at each timestep to encourage it to reach the goal as quickly as possible.The state space is continuous, and corresponds to real-valuedCartesian coordinates of the agent’s position. The agent has16 actuators pointing in different directions. Turning eachactuator on moves the agent in the direction of the actuator.However, each actuator is unreliable, and is therefore onlyavailable with some fixed probability.

Product recommender system. In online marketing andsales, product recommendation is a popular problem. Dueto various factors such as stock outage, promotions, deliveryissues etc., not all products can be recommended always. Tomodel this, we consider a synthetic setup of providing rec-ommendation to a user from a batch of 100 products, eachavailable with some fixed probability and associated witha stochastic reward corresponding to profit. Each user hasa real-valued context, which forms the state space, and therecommender system interacts with a randomly chosen userfor 5 steps. The goal for the recommender system is to sug-gest products that maximize total profit. Often the problemof recommendation is formulated as a contextual bandit orcollaborative filtering problem, but as shown by Theocharous,Thomas, and Ghavamzadeh (2015) these approaches fail tocapture the long term value of the prediction. Hence we resortto the full RL setup.

ResultsHere we only discuss the representative results for the threemajor questions of interest. Plots for detailed evaluations areavailable in Appendix F.

(a) For the routing problem in San Francisco, as both thestates and actions are discrete, the q-function for each state-action pair has a unique parameter. When no parameters areshared, SAS-Q-learning will not diverge. Therefore, in thisdomain, we notice that SAS-Q-learning performs similarlyto the proposed algorithms. However, in many large-scaleproblems, the use of function approximators is crucial forestimating the optimal policy. For the robot locomotion taskin the maze domain and the recommender system, the statespace is not discrete and hence function approximators arerequired to obtain the state features. As we saw in Section ,the sharing of state features can create problems for SAS-Q-learning. The increased variance in the performance of SAS-Q-learning is visible in both the Maze and the Recommender

system domains in Figure 2. While the SAS-Q eventuallyperforms the same on the Maze domain, its performanceimprovement saturates quickly in the recommender systemdomain thus resulting in a sub-optimal policy.

(b) To provide visual intuition for the behavior of adap-tive variance mitigation, we report the values of λ1 and λ2

over the training duration in Figure 2. As several factors arecombined through (4) to influence the λ values, it is hardto pinpoint any individual factor that is responsible for theobserved trend. However, note that for both the routing prob-lem in San Francisco and the robot navigation in maze, thegoal reward is obtained on reaching the destination and inter-mediate actions do not impact the total return significantly.Intuitively, this makes the action set conditioned baseline qsimilarly correlated to the observed return as the state onlyconditioned baseline, v, but at the expense of estimating sig-nificantly more number of parameters. Thus the importancefor q is automatically adapted to be closer to zero. On theother hand, in recommender system, each product has a sig-nificant amount of associated reward. Therefore, the totalreturn possible during each episode has a strong dependencyon the available action set and thus the magnitude of weightfor q is much larger than that for v.

(c) To understand the impact of the probability of an actionbeing available, we report the best performances for all thealgorithms for different probability values in Figure 3. Wenotice that in the San Francisco routing domain, SAS-Q-learning has a slight edge over the proposed methods. Thiscan be attributed to the fact that off-policy samples can bere-used without causing any divergence problems as statefeatures are not shared. For the maze and the recommendersystem tasks, where function approximators are necessary,the proposed methods significantly out-perform SAS-Q.

ConclusionBuilding upon the SAS-MDP framework of Boutilier et al.(2018), we studied an under-addressed problem of dealingwith MDPs with stochastic action sets. We highlighted someof the limitations of the existing method and addressed themby generalizing policy gradient methods for SAS-MDPs. Ad-ditionally, we introduced a novel baseline and an adaptivevariance reduction technique unique to this setting. Our ap-proach has several benefits. Not only does it generalize thetheoretical properties of standard policy gradient methods,but it is also practically efficient and simple to implement.

AcknowledgementThe research was supported by and partially conducted atAdobe Research. We are also immensely grateful to the threeanonymous reviewers who shared their insights and feedback,specially to the second reviewer who helped improve thecounter example.

ReferencesAmari, S.-i., and Nagaoka, H. 2007. Methods of informationgeometry, volume 191. American Mathematical Soc.Amari, S.-I. 1998. Natural gradient works efficiently in learning.Neural computation 10(2):251–276.Bagnell, J. A., and Schneider, J. G. 2003. Covariant policy search.In IJCAI-03, Proceedings of the Eighteenth International JointConference on Artificial Intelligence.

Baird, L. 1995. Residual algorithms: Reinforcement learning withfunction approximation. In Machine Learning Proceedings 1995.Elsevier. 30–37.Bertsekas, D. P., and Tsitsiklis, J. N. 2000. Gradient convergencein gradient methods with errors. SIAM Journal on Optimization10(3):627–642.

Bhatnagar, S.; Ghavamzadeh, M.; Lee, M.; and Sutton, R. S. 2008.Incremental natural actor-critic algorithms. In Advances in neuralinformation processing systems, 105–112.

Boutilier, C.; Cohen, A.; Daniely, A.; Hassidim, A.; Mansour, Y.;Meshi, O.; Mladenov, M.; and Schuurmans, D. 2018. Planning andlearning with stochastic action sets. In IJCAI.

Degris, T.; Pilarski, P. M.; and Sutton, R. S. 2012. Model-freereinforcement learning with continuous action in practice. In Pro-ceedings of the 2012 American Control Conference.

Feng, Y., and Yan, H. 2000. Optimal production control in adiscrete manufacturing system with unreliable machines and randomdemands. IEEE Transactions on Automatic Control.Geffner, T., and Domke, J. 2018. Using large ensembles of controlvariates for variational inference. In Advances in Neural InformationProcessing Systems.Gendreau, M.; Laporte, G.; and Seguin, R. 1996. Stochastic vehiclerouting. European Journal of Operational Research.

Gordon, G. J. 1996. Chattering in sarsa(lambda). A CMU LearningLab Internal Report.Grathwohl, W.; Choi, D.; Wu, Y.; Roeder, G.; and Duvenaud,D. 2017. Backpropagation through the void: Optimizing con-trol variates for black-box gradient estimation. arXiv preprintarXiv:1711.00123.Graybill, F. A., and Deal, R. 1959. Combining unbiased estimators.Biometrics 15(4):543–550.Greensmith, E.; Bartlett, P. L.; and Baxter, J. 2004. Variance re-duction techniques for gradient estimates in reinforcement learning.Journal of Machine Learning Research 5(Nov):1471–1530.Harper, G. W., and Skiba, S. 2007. User-personalized media sam-pling, recommendation and purchasing system using real-time in-ventory database. US Patent 7,174,312.Kakade, S. M., et al. 2003. On the sample complexity of reinforce-ment learning. Ph.D. Dissertation, University of London London,England.Kakade, S. M. 2002. A natural policy gradient. In Advances inneural information processing systems, 1531–1538.

Kanade, V.; McMahan, H. B.; and Bryan, B. 2009. Sleeping ex-perts and bandits with stochastic action availability and adversarialrewards. In Proceedings of the Twelfth International Conference onArtificial Intelligence and Statistics, AISTATS.Kleinberg, R.; Niculescu-Mizil, A.; and Sharma, Y. 2010. Regretbounds for sleeping experts and bandits. Machine learning.Konidaris, G.; Osentoski, S.; and Thomas, P. 2011. Value functionapproximation in reinforcement learning using the fourier basis. InTwenty-fifth AAAI conference on artificial intelligence.Liu, H.; Feng, Y.; Mao, Y.; Zhou, D.; Peng, J.; and Liu, Q. 2017.Action-depedent control variates for policy optimization via stein’sidentity. arXiv preprint arXiv:1710.11198.Mahdian, M.; Nazerzadeh, H.; and Saberi, A. 2007. Allocating on-line advertisement space with unreliable estimates. In Proceedingsof the 8th ACM conference on Electronic commerce. ACM.Meir, R., et al. 1994. Bias, variance and the combination ofestimators: The case of linear least squares. Citeseer.Morimura, T.; Uchibe, E.; and Doya, K. 2005. Utilizing the naturalgradient in temporal difference reinforcement learning with eligibil-ity traces. In International Symposium on Information Geometryand its Application, 256–263.Nikolova, E., and Karger, D. R. 2008. Route planning underuncertainty: The canadian traveller problem. In AAAI.Nikolova, E.; Brand, M.; and Karger, D. R. 2006. Optimal routeplanning under uncertainty. In ICAPS, volume 6, 131–141.Papadimitriou, C. H., and Yannakakis, M. 1991. Shortest pathswithout a map. Theoretical Computer Science 84(1):127–150.Polychronopoulos, G. H., and Tsitsiklis, J. N. 1996. Stochasticshortest path problems with recourse. Networks: An InternationalJournal 27(2):133–143.Ribeiro, A.; Sidiropoulos, N. D.; and Giannakis, G. B. 2008. Opti-mal distributed stochastic routing algorithms for wireless multihopnetworks. IEEE Transactions on Wireless Communications.Sutton, R. S., and Barto, A. G. 2018. Reinforcement learning: Anintroduction. MIT press.Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour, Y. 2000.Policy gradient methods for reinforcement learning with function ap-proximation. In Advances in neural information processing systems,1057–1063.Tan, B., and Srikant, R. 2012. Online advertisement, optimizationand stochastic networks. IEEE Transactions on Automatic Control.Theocharous, G.; Thomas, P. S.; and Ghavamzadeh, M. 2015. Adrecommendation systems for life-time value optimization. In Pro-ceedings of the 24th International Conference on World Wide Web,1305–1310. ACM.Thomas, P. S., and Barto, A. G. 2012. Motor primitive discovery. InProcedings of the IEEE Conference on Development and Learningand Epigenetic Robotics, 1–8.Thomas, P. S., and Brunskill, E. 2017. Policy gradient methodsfor reinforcement learning with function approximation and action-dependent baselines. arXiv preprint arXiv:1706.06643.Thomas, P.; Dann, C.; and Brunskill, E. 2018. Decoupling gradient-like learning rules from representations. In International Conferenceon Machine Learning.Thomas, P. 2014. Bias in natural actor-critic algorithms. In Interna-tional Conference on Machine Learning, 441–448.Tsitsiklis, J., and Roy, B. 1983. An analysis of temporal-difference with function approximation. IEEE Trans. Autom. Con-trol 42(5):834–836.

Tucker, G.; Bhupatiraju, S.; Gu, S.; Turner, R. E.; Ghahramani, Z.;and Levine, S. 2018. The mirage of action-dependent baselines inreinforcement learning. arXiv preprint arXiv:1802.10031.Wang, C.; Chen, X.; Smola, A. J.; and Xing, E. P. 2013. Variancereduction for stochastic gradient optimization. In Advances inNeural Information Processing Systems.Wu, C.; Rajeswaran, A.; Duan, Y.; Kumar, V.; Bayen, A. M.; Kakade,S.; Mordatch, I.; and Abbeel, P. 2018. Variance reduction for policygradient with action-dependent factorized baselines. arXiv preprintarXiv:1803.07246.

Reinforcement Learning When All Actions are NotAlways Available (Supplementary Material)

A: SAS Policy Gradient

Lemma 1 (SAS Policy Gradient). For all s ∈ S,

d

dθJ(θ) =

∞∑t=0

∑s∈S


ϕ(s, α)∑a∈α


∂θ.

Proof.

∂vθ(s)

∂θ=∂

∂θE

[ ∞∑k=0

γkRt

∣∣∣∣∣St = s, θ

]

=∂

∂θ

∑α∈2B

ϕ(s, α)∑a∈αs

Pr(At = a|St = s,At = α, θ)E

[ ∞∑k=0

γkRt+k

∣∣∣∣∣St = s,At = a, θ

]

=∂

∂θ

∑α∈2B

ϕ(s, α)∑a∈α

πθ(s, α, a)qθ(s, a)

=∑α∈2B

ϕ(s, α)∑a∈α

(∂πθ(s, α, a)

∂θqθ(s, a) + πθ(s, α, a)

∂qθ(s, a)

∂θ

)

=∑α∈2B

ϕ(s, α)∑a∈α

∂πθ(s, α, a)

∂θqθ(s, a)

+∑α∈2B

ϕ(s, α)∑a∈α

πθ(s, α, a)∂

∂θ

∑s′∈S

P (s, a, s′)(R(s, a) + γvθ(s′)

)(5)

=∑α∈2B

ϕ(s, α)∑a∈α

∂πθ(s, α, a)

∂θqθ(s, a) + γ

∑s′∈S

Pr(St+1 = s′|St = s, θ)∂vθ(s′)

∂θ,

where (5) comes from unrolling the Bellman equation. We started with the partial derivative of the value of a state, expanded thedefinition of the value of a state, and obtained an expression in terms of the partial derivative of the value of another state. Now,we again expand ∂vθ(s′)/∂θ using the definition of the state-value function and the Bellman equation.

∂vθ(s)

∂θ=∑α∈2B

ϕ(s, α)∑a∈α

∂πθ(s, α, a)

∂θqθ(s, a)

+ γ∑s′∈S

Pr(St+1 = s′|St = s, θ)∂

∂θ

( ∑α′∈2B

ϕ(s′, α′)∑a′∈α′

πθ(s′, α′, a′)qθ(s′, a′)

)

=∑α∈2B

ϕ(s, α)∑a∈α

∂πθ(s, α, a)

∂θqθ(s, a)

+ γ∑s′∈S

Pr(St+1 = s′|St = s, θ)∑α′∈2B

ϕ(s′, α′)

(∑a′∈α′

∂πθ(s′, α′, a′)

∂θqθ(s′, a′) + πθ(s′, α′, a′)

∂qθ(s′, a′)

∂θ

)

=∑α∈2B

ϕ(s, α)∑a∈α

∂πθ(s, α, a)

∂θqθ(s, a)

+ γ∑s′∈S

Pr(St+1 = s′|St = s, θ)∑α′∈2B

ϕ(s′, α′)∑a′∈α′

∂πθ(s′, α′, a′)

∂θqθ(s′, a′)

+ γ∑s′∈S

Pr(St+1 = s′|St = s, θ)∑α′∈2B

ϕ(s′, α′)∑a′∈α′

πθ(s′, α′, a′)∂

∂θ

(∑s′′∈S

P (s′, a′, s′′)(R(s′, a′) + γvθ(s′′))

)

=∑α∈2B

ϕ(s, α)∑a∈α

∂πθ(s, α, a)

∂θqθ(s, a)

+ γ∑s′∈S

Pr(St+1 = s′|St = s, θ)∑α′∈2B

ϕ(s′, α′)∑a′∈α′

∂πθ(s′, α′, a′)

∂θqθ(s′, a′)

+ γ∑s′∈S

Pr(St+1 = s′|St = s, θ)∑α′∈2B

ϕ(s′, α′)∑a′∈α′

πθ(s′, α′, a′)∑s′′∈S

P (s′, a′, s′′)γ∂vθ(s′′)

∂θ

=∑α∈2B

ϕ(s, α)∑a∈α

∂πθ(s, α, a)

∂θqθ(s, a)︸︷︷︸

first term

+ γ∑s′∈S

Pr(St+1 = s′|St = s, θ)∑α′∈2B

ϕ(s′, α′)∑a′∈α′

∂πθ(s′, α′, a′)

∂θqθ(s′, a′)︸︷︷︸

second term

+ γ2∑s′′∈S

Pr(St+2 = s′′|St = s, θ)∂vθ(s′′)

∂θ. (6)

Expanding ∂vθ(s′)/∂θ allowed us to write it in terms of the partial derivative of yet another state, s′′. We could continue thisprocess, “unravelling” the recurrence further. Each time that we expand the partial derivative of the value of a state with respectto the parameters, we get another term. The first two terms that we have obtained are marked above. If we were to unravel theexpression more times, by expanding ∂vθ(s′′)/∂θ and then differentiating, we would obtain the subsequent third, fourth, etc.,terms.

Finally, to get the desired result, we expand the start-state objective and take the derivative with respect to it,

d

dθJ(θ) =

∑s∈S

d0(s)∂

∂θvθ(s). (7)

Combining results from (6) and (7), we index each term by t, with the first term being t = 0, the second t = 1, etc., whichresults in the expression:

d

dθJ(θ) =

∞∑t=0

∑s∈S


ϕ(s, α)∑a∈α


∂θ.

Notice that to get the gradient with respect to J(θ), we have included a sum over all the states weighted by, d0(s), the startstate probability. When t = 0, the only state where Pr(S0 = s|S0 = s, θ) is not zero will be when s = s (at which point thisprobability is one). This allows us to succinctly represent all the terms. With this we conclude the proof.

B: ConvergenceLemma 2. Under Assumptions (A1)-(A3), SAS policy gradient algorithm causes ∇J(θt)→ 0 as t→∞, with probability one.

Proof. Following the standard result on convergence of gradient ascent (descent) methods (Bertsekas and Tsitsiklis 2000), weknow that under Assumptions (A1)-(A3), either J(θ)→∞ or ∇J(θ)→ 0 as t→∞. However, maximum rewards possibleis Rmax and γ < 1, therefore J(θ) is bounded above by Rmax/(1 − γ). Hence J(θ) cannot go to∞ and we get the desiredresult.

C: SAS Natural Policy GradientProperty 1 (Fisher Information Matrix). For a policy, parameterized using weights θ, let ψθ(s, α, a) := ∂ log πθ(s, α, a)/∂θ,then the Fisher information matrix is,

Fθ =

∞∑t=0

∑s∈S


ϕ(s, α)∑a∈α

πθ(s, α, a)ψ(s, α, a)ψ(s, α, a)>.

Proof. To prove this result, we first note the following relation by Amari and Nagaoka (2007) which connects the Hessian andthe FIM of a random variable X parameterized using θ,

E[∂2 log Pr(X)

∂θ2

]= −E

[∂ log Pr(X)

∂θ

∂ log Pr(X)

∂θ

>]. (8)

Now, let Tθ denote the random variable corresponding to the trajectories observed using policy πθ. Let τ =(s0, α0, a0, s1, α1, a1, ...) denote an outcome of Tθ, then the probability of observing this trajectory, τ , is given by,

Pr(Tθ = τ) = Pr(s0)

∞∏t=0

Pr(αt|st) Pr(at|st, αt) Pr(st+1|st, at)

= d0(s0)

∞∏t=0

ϕ(st, αt)πθ(st, αt, at)P (st, at, st+1).

Therefore,

∂2

∂θ2log Pr(Tθ = τ) =

∂2

∂θ2log

(d0(s0)

∞∏t=0

ϕ(st, αt)πθ(st, αt, at)P (st, at, st+1)

)

=∂2

∂θ2

(log d0(s0) +

∞∑t=0

logϕ(st, αt) +

∞∑t=0

log πθ(st, αt, at) +

∞∑t=0

logP (st, at, st+1)

)

=

∞∑t=0

∂2

∂θ2log πθ(st, αt, at). (9)

We know that Fisher Information Matrix for a random variable, which in our case is Tθ, is given by,

Fθ = E

[∂ log Pr(Tθ)

∂θ

∂ log Pr(Tθ)

∂θ

>]

= −E[∂2 log Pr(Tθ)

∂θ2

](Using Equation (8))

= −E

[ ∞∑t=0

∂2

∂θ2log πθ(st, αt, at)

](Using Equation (9))

= −∑τ∈Tθ

Pr(Tθ = τ)

∞∑t=0

∂2

∂θ2log πθ(st, αt, at), (10)

where the summation over Tθ corresponds to all possible values of s, α and a for every step t in the trajectory. Expanding theinner summation in (10),

Fθ = −∑τ∈Tθ

Pr(Tθ = τ)∂2

∂θ2log πθ(s0, α0, a0)−

∑Tθ

Pr(Tθ)∂2

∂θ2log πθ(s1, α1, a1)− ... (11)

Note that the summation in (11) over all possible trajectories, i.e. all possible values of s, α and a for every step t, marginalizesout the terms not associated with respective log πθ terms, i.e.,

Fθ =−∑s0∈S

Pr(S0 = s0|θ)∑α0∈2B

ϕ(s0, α0)∑a0∈α0

πθ(s0, α0, a0)∂2

∂θ2log πθ(s0, α0, a0)

−∑s1∈S

Pr(S1 = s1|θ)∑α1∈2B

ϕ(s1, α1)∑a1∈α1

πθ(s1, α1, a1)∂2

∂θ2log πθ(s1, α1, a1)

− ... (12)

Combining all the terms in (12) and discounting them appropriately with γ, we get,

Fθ = −∞∑t=0

∑s∈S


ϕ(s, α)∑a∈α

πθ(s, α, a)∂2

∂θ2log πθ(s, α, a). (13)

Finally, note that using (8),∑a∈α

πθ(s, α, a)∂2

∂θ2log πθ(s, α, a) = −

∑a∈α

πθ(s, α, a)ψ(s, α, a)ψ(s, α, a)>. (14)

Combining (13) and (14) we get,

Fθ =

∞∑t=0

∑s∈S


ϕ(s, α)∑a∈α

πθ(s, α, a)ψ(s, α, a)ψ(s, α, a)>.

With this we conclude the proof.

Lemma 3 (SAS Natural Policy Gradient). Let w be a parameter such that,

∂

∂wE

[1

2

∞∑t

γt(ψ(St,At, At)>w − qθ(St, At)

)2]= 0, (15)

then for all s ∈ S inM′,

∇J(θ) = w.

Proof. We begin by expanding (15),

E

[ ∞∑t

γt(ψ(St,At, At)>w − qθ(St, At)

)ψ(St,At, At)

]= 0

E

[ ∞∑t

γtψ(St,At, At)ψ(St,At, At)>w

]= E

[ ∞∑t

γtψ(St,At, At)qθ(St, At)

]. (16)

∇J(θ) := F−1θ

∂

∂θJ(θ)

= F−1θ

∞∑t=0

∑s∈S


ϕ(s, α)∑a∈α


∂θ

= F−1θ

∞∑t=0

∑s∈S


ϕ(s, α)∑a∈α

πθ(s, α, a)ψθ(s, α, a)qθ(s, a). (17)

Now combining (16) and (17),

∇J(θ) = F−1θ

∞∑t=0

∑s∈S


ϕ(s, α)∑a∈α

πθ(s, α, a)ψθ(s, α, a)ψθ(s, α, a)>w

= F−1θ Fθw

= w,

where the second last step follows from Property 1. With this we conclude the proof.

D: Adaptive Variance MitigationProperty 2 (Unbiased estimator). Let J(s, α, a, θ) := ψθ(s, α, a)

(qθ(s, a) + λ1v(s) + λ2q(s, α)

)and dπ(s) := (1 −

γ)∑∞t γt Pr(St = s), then for any values of λ1 ∈ R and λ2 ∈ R,

∇J(θ) = E[J(s, α, a, θ)

∣∣∣dπ, ϕ, π] .

Proof. We begin by expanding ∇J(θ),

E[J(s, α, a, θ)

∣∣∣dπ, ϕ, π] = E[ψθ(s, α, a)

(qθ(s, a)

)]+ E

[ψθ(s, α, a)

(λ1v(s) + λ2q(s, α)

)].

Now consider the term associated with the baselines v(s) and q,

E[ψθ(s, α, a)

(λ1v(s) + λ2q(s, α)

)]=

∑α∈2B,s∈S

Pr(s, α)∑a∈α

πθ(s, α, a)∂ lnπθ(s, α, a)

∂θ

(λ1v(s) + λ2q(s, α)

)=

∑α∈2B,s∈S

Pr(s, α)(λ1v(s) + λ2q(s, α)

)∑a∈α


∂θ. (18)

Focusing only on the right part of (18),∑a∈α


∂θ=∑a∈α

πθ(s, α, a)1

πθ(s, α, a)

∂πθ(s, α, a)

∂θ

=∑a∈α

∂πθ(s, α, a)

∂θ

=∂

∂θ

∑a∈α

πθ(s, α, a)

=∂

∂θ1

= 0. (19)

Combining (18) and (19), we observe that the bias of this new baseline combination is zero and we get the desired result.

Lemma 4 (Adaptive variance mitigation). Let

A = [λ1, λ2]>,

B = [ψθ(s, α, a)v(s), ψθ(s, α, a)q(s, α)]>,

C = [ψθ(s, α, a)qθ(s, a)]>,

such that, A ∈ R2×1,B ∈ Rd×2 and C ∈ Rd×1, then the A that minimizes variance of J is given by,

A = −(E[B>B

])−1 E[B>C

].

Proof. Let the sample estimate for the gradient be given by,

J(θ) := J(s, α, a, θ) = ψθ(s, α, a)(qθ(s, a) + λ1v(s) + λ2q(s, α)

).

We aim to find the values of λ that minimizes the variance of this estimator, i.e.,

λ = argminλ

[var(J(θ))

].

The variance of the estimator can be computed as following,

var(J(θ)) = E[(J(θ)− E

[J(θ)

])>(J(θ)− E

[J(θ)

])]= E

[J(θ)>J(θ)

]− 2E

[J(θ)>E

[J(θ)

]]+E[J(θ)

]>E[J(θ)

]= E

[J(θ)>J(θ)

]− E

[J(θ)

]>E[J(θ)

]. (20)

From Property 2 we know that,

E[J(θ)

]= E

[ψθ(s, α, a)

(qθ(s, a) + λ1v(s) + λ2q(s, α)

)]= E

[ψθ(s, α, a)qθ(s, a)

]+ 0

= E[C].

Expanding (20) in the matrix notations,

var(J(θ)) = E[(

C + BA>)> (

C + BA>)]− E

[C]>E[C]

= E[C>C

]+ E

[C>BA>

]+ E

[AB>C

]+ E

[AB>BA>

]− E

[C]>E[C]. (21)

Since the first and last term from (21) are independent of A, it does not effect the optimization step. The remaining terms thatmatter are,

E[C>BA>

]+ E

[AB>C

]+ E

[AB>BA>

].

Differentiating these terms with respect to A, and by equating it to 0, we get,

2E[B>C

]+ 2E

[AB>B

]= 0

2E[AB>B

]= −2E

[B>C

]AE[B>B

]= −E

[B>C

]A = −

(E[B>B

])−1 E[B>C

]

E: SAS Natural Policy GradientPseudo-code for SAS natural policy gradient is provided in Algorithm 2. Let the learning-rate for updating θ and w be given byηθ and ηw, respectively. Similar to Algorithm 1, we first collect the transition batch B and compute the sampled returns fromeach state in Lines 2 and 3. Following Lemma 3, we update the parameter w in Line 5 to minimize its associated TD error. Theupdated parameter w is then used to update the policy parameters θ. As dividing by a scalar does not change the direction of the(natural) gradient, we normalize the update using norm of w in Line 6 for better stability.

Algorithm 2: Stochastic Action Set Natural Policy Gradient (SAS-NPG)1 for episode = 0, 1, 2... do

# Collect transition batch using πθ2 B = {(s0, α0, a0, r0), ..., (sT , αT , aT , rT )}3 G(st) =

∑T−tk=0 γ

krt+k

# Perform batch update on parameters4 ψθ(s, α, a) = ∂ log πθ(s,α,a)

∂θ

5 w ← w + ηw(G(s)− ψθ(s, α, a)>w)ψθ(s, α, a) . Update w6 θ ← θ + ηθ

w‖w‖ . Update πθ

F: Empirical Analysis DetailsImplementation detailsPolicy parmaterization. To make the policy handle stochastic action sets, we make use of a mask which indicates the availableactions. Formally, let φ(s) ∈ Rd be the feature vector of the state and let θ ∈ Rd×|B| denote the parameters that project thefeatures on the space of all actions. Let y := φ(s)>θ denote the scores for each action and let 1{a∈α} be the indicator variabledenoting whether the action a is in the available action set α or not. The probability of choosing an action is then computedusing the masked softmax, i.e.,

πθ(s, α, a) :=exp(ya) · 1{a∈α}∑

a′∈αexp(ya′) · 1{a′∈α}

,

where ya corresponds to the score of action a in y.

Figure 4: (Left) An illustrations of the top view of the maze domain, where the red dot corresponds to the agent and the greenarrows around it represent the actions. The star represents the goal position. (Right) Map view of San Francisco bay area. Weconsider the road network similar to the one used by Boutilier et al. (2018).

Hyperparamter settings. For the maze domain, state features were represented using 3rd order coupled Fourier basis(Konidaris, Osentoski, and Thomas 2011). For the San Francisco map domain, one-hot encoding was used to represent each ofthe nodes (states) in the road-network. For the recommender system domain, the user-context provided by the environment wasdirectly used as state-features. Using these features, single layer-neural networks were used to represent the policy, baselines andthe q-function for all the algorithms, for all the domains. The discounting parameter γ was set to 0.99 for all the domains.

For SAS policy gradient, the learning rates for both the baselines were searched over [1e− 2, 1e− 4]. The learning rate forpolicy was searched over [5e− 3, 5e− 5]. The hyper-parameter ηλ was kept fixed to 0.999 throughout. For SAS natural policygradient, the learning rate, ηw, was searched over [1e− 2, 1e− 4].

For SAS-Q-learning baseline, the exploration parameter for ε-greedy was searched over [0.05, 0.15] and the Learning rate forthe q-function was searched over [1e− 2, 1e− 4]. To encompass both online and batch learning for SAS-Q-learning, additionalhyperparameter search was done over the batch-sizes {1, 8, 16} and the number of batches {1, 8, 16} per update to the q-function.Note that when both the batch size and the number of batches is 1, it becomes the online version (Boutilier et al. 2018).

In total, 1000 settings for each algorithm, for each domain, were uniformly sampled from the mentioned hyper-parameterranges/sets. Results from the best performing setting is reported in all the plots. Each hyper-parameter setting was ran using 30different seeds to get the standard deviation of the performance.

Additional Experimental ResultsIn Figures 5 and 6 we report the learning curves and the adapted λ1 and λ2 values for all the domains under different probabilityvalues of action availability.

Figure 5: Best performing learning curves on different settings. (Left to right) San Francisco Map domain, Maze domain, and theRecommender System domain. (Top to bottom) Probability of any action being available in the action set, ranging from 0.8 to0.2, for the respective domains. Shaded regions correspond to one standard deviation and were obtained using 30 trials.

Figure 6: Autonomously adapted values of λ1 and λ2 (associated with v and q, respectively) for the best performing SAS-PGinstance on different settings. (Left to right) San Francisco Map domain, Maze domain, and the Recommender System domain.(Top to bottom) Probability of any action being available in the action set, ranging from 0.8 to 0.2, for the respective domains.Shaded regions correspond to one standard deviation and were obtained using 30 trials.

Reinforcement Learning When All Actions are Not Always ... · motivate the question we aim to...

Documents

Transcript of Reinforcement Learning When All Actions are Not Always ... · motivate the question we aim to...