Bayesian Exploration and Interactive Demonstration in ...Bayesian Exploration and Interactive...

Bayesian Exploration and Interactive Demonstration in ContinuousState MAXQ-Learning

Kathrin Grave and Sven Behnke

Abstract— Deploying robots for service tasks requires learn-ing algorithms that scale to the combinatorial complexity of ourdaily environment. Inspired by the way humans decomposecomplex tasks, hierarchical methods for robot learning haveattracted significant interest. In this paper, we apply the MAXQmethod for hierarchical reinforcement learning to continuousstate spaces. By using Gaussian Process Regression for MAXQvalue function decomposition, we obtain probabilistic estimatesof primitive and completion values for every subtask within theMAXQ hierarchy. From these, we recursively compute proba-bilistic estimates of state-action values. Based on the expecteddeviation of these estimates, we devise a Bayesian explorationstrategy that balances optimization of expected values and riskfrom exploring unknown actions. To further reduce risk andto accelerate learning, we complement MAXQ with learningfrom demonstrations in an interactive way. In every situationand subtask, the system may ask for a demonstration if there isnot enough knowledge available to determine a safe action forexploration. We demonstrate the ability of the proposed systemto efficiently learn solutions to complex tasks on a box stackingscenario.

I. INTRODUCTION

One of the long-standing goals of robotics research is theeventual deployment of robots to tasks in our everyday life.Given the complexity of our environment and the way itis constantly changing, this requires learning methods thatare able to efficiently learn complex task policies in largestate spaces. Hierarchical methods are a promising approachto scale robot learning to these challenges. Recognizing thefact that complex tasks usually exhibit hierarchical structure,hierarchical learning methods employ various kinds of ab-straction to decompose tasks into smaller units of reducedcomplexity. For instance, repetitive sequences of actions canoften be aggregated to macro operators, allowing complextasks to be solved in terms of these more abstract actionsas illustrated in Fig. 1. Reducing the number of steps andchoices on higher levels in this way may greatly acceleratelearning. This kind of abstraction is often referred to astemporal abstraction. Another way to reduce the complexityof tasks is to identify dimensions of the state space that areirrelevant to some of the subtasks. Subtasks may also be re-used from different contexts in order to efficiently exploitexisting knowledge. These strategies are known as stateabstraction and subtask sharing [1]. Dietterich proposed theMAXQ value function decomposition [2]—a framework forhierarchical reinforcement learning that successfully lever-ages these kinds of abstraction in discrete settings.

The authors are with the Autonomous Intelligent Systems group,Department of Computer Science, University of Bonn, [email protected] This work was supported by the B-ITResearch School.

Fig. 1. Hierarchical decomposition of the sample task of moving a stack ofboxes from the whole task on the top level via subtasks for moving partialstacks to movement primitives on the bottom level.

In this work, we apply the MAXQ framework to con-tinuous state spaces—a typical characteristic of practicaltasks—and complement reinforcement learning with learningfrom demonstrations in an integrated, interactive approach.We employ Gaussian Process Regression (GPR) [3] toobtain approximations of the value function for primitiveactions and the completion functions of composite actionsof every subtask of a hierarchical task. From these, wecompute probabilistic estimates of state-action values byrecursively aggregating estimates of completion values ofcomposite tasks and values of primitive tasks throughouta subtask hierarchy. These estimates are central to ourapproach since their associated uncertainties express theconfidence of the system in the outcome of actions giventhe available knowledge. Based on the estimated values andtheir uncertainties, we derive a Bayesian exploration criterionthat makes informed decisions on promising actions insteadof performing arbitrary random exploration. This drasticallyreduces the number of trials needed to find rewarding actions.

Besides performance with respect to obtained rewards andnumber of trials, safety is another major concern in practicalapplications. A successful learning algorithm must avoidexploring actions with unpredictable outcomes. We thereforeincorporate a trade-off in our exploration strategy that usesexpected deviation [4] to balance optimization of reward andrisk from executing actions with uncertain outcome.

The efficiency of our exploration approach is tied to theavailability of prior experience to guide it. Rather than boot-

behnke

Schreibmaschine

IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, May 2014.

strapping the system, we propose an interactive approach,allowing the system to request a demonstration of the currentsubtask from a human expert if the available knowledgeis insufficient. Complementing reinforcement learning withlearning from demonstrations in this way provides functionalpriors that accelerate reinforcement learning by narrowingits search space. On the other hand, the human effort ofproviding demonstrations is reduced by iteratively choosingthe most appropriate learning method for every subtask onevery level of abstraction of a hierarchical task. Hence,demonstrations are limited to specific subtasks at specificlevels of abstraction whereas autonomous exploration ofsolutions is preferred whenever possible. Finally, in com-bination with our Bayesian exploration criterion, the set ofhuman demonstrations acts as a behavioral prior that biasesthe robot’s behavior towards actions that are predictable fora human collaborator. This is advantageous to applicationswhere robots are to collaborate closely with humans.

II. RELATED WORK

Over the past decade, many researchers have investigatedways to exploit hierarchical structure to scale reinforcementlearning to complex robotics applications [5].

One direction of research explores hierarchical policysearch methods that optimize the long-term reward of pa-rameterized policies. Daniel et al. [6] proposed a hierarchicalextension of the Relative Entropy Policy Search (REPS) [7]method to learn sequences of parameterized movement pri-mitives. The sequence learning problem is formulated as aconstrained optimization problem on the parameters of ahigh-level gating network responsible for action selectionand the parameters of low-level motion primitives. Stulp andSchaal [8] extend the PI2 algorithm [9] to simultaneouslyoptimize the intermediate goals of a sequence of movementprimitives and their shape parameters. Both approaches allowfor continuous state- and action spaces by using param-eterized dynamical systems to represent primitive actionsand perform supervised learning from demonstrations tobootstrap parameters. In both cases, though, the length ofan action sequence has to be known in advance, limiting theflexibility of the system to select optimal action sequences.

Konidaris et al. [10] developed a more general approachwhere trajectories are first segmented and then merged toskill trees by considering their statistical similarities in re-verse order. Sequences may be trained either by demonstrat-ing viable solutions or by explorative reinforcement learning.Similar to our approach, skill trees encode a deterministictask policy that does not require planning at runtime.

The MAXQ method presents an alternate, value function-based approach that allows reinforcement learning agentsto benefit from state abstractions. Several extensions to theoriginal MAXQ method have been proposed to improvescalability and performance by integrating prior knowledge.

One direction of research investigates ways to make moreefficient use of experiences gathered by integrating modelsof the environment and the reward structure into the MAXQframework. Cao et al. [11] augmented the MAXQ hierarchy

with Bayesian priors on distributions over primitive rewardand environment models. Learning then proceeds in a mixedmodel-based/model-free fashion. Every action on the prim-itive level leads to an update of the posterior over modelparameters. At fixed intervals, new primitive model parame-ters are sampled and the reward and completion values in theMAXQ hierarchy are updated recursively, using the sampledmodels. Our approach is different in that it keeps the originalmodel-free structure of MAXQ and proposes a Bayesianexploration strategy. Expert knowledge is incorporated inthe form of demonstrated task solutions rather than priorson model distributions. Jong and Stone [12] proposed ahierarchical decomposition of reward and transition modelsfor all subtasks of a hierarchy and extended MAXQ withthe optimistic exploration of the model-based R-MAX [13]algorithm. The resulting algorithm, fitted R-MAXQ, usesinstance-based function approximation to generalize modelparameters to unobserved state-action pairs and fitted valueiteration [14] to learn policies for subtasks on continuousstate and action spaces. Whereas the exploration strategyof R-MAX favors unknown state-action pairs, we arguein this work, that an exploration strategy which balancesoptimization of reward with risk from executing unknownactions is favorable in real-world applications.

Bai et al. [15] apply MAXQ to continuous state- and actionspaces by avoiding an explicit representation of completion-or value functions and, instead, propose an online algorithmthat recursively computes values of visited states accordingto the MAXQ decomposition. To render this process feasible,the proposed MAXQ-OP algorithm maintains goal statedistributions for subtasks and approximates the completionfunction based on a sampled subset of the possible goalstates of a subtask. The performance of the approach largelydepends on the availability of domain knowledge to specify aprior distribution of goal states for subtasks. In contrast, ourapproach uses task demonstrations to convey expert knowl-edge. By using instance-based generalization of informationfrom visited states, our approach also avoids precomputingexhaustive policies, similar to an online approach. However,in contrast to the ad-hoc computations performed by theapproach of Bai et al., our system remembers observedinstances to improve future decisions.

III. PROPOSED METHOD

In this work, we propose an integrated approach to hier-archical robot learning in complex domains, based on theMAXQ framework for hierarchical reinforcement learning.Our first contribution is the application of MAXQ to con-tinuous state spaces using Gaussian Process (GP) approx-imations of completion- and value functions of compositeand primitive tasks, respectively. From these, we computeestimates of state-action values by recursively aggregatingGP estimates and associated uncertainties of subtask hierar-chies. For reinforcement learning, we derive a Bayesian ex-ploration criterion based on expected deviation that balancesoptimization of long-term rewards and potential risks to therobot from executing actions with unpredictable outcome for

efficient and safe learning. Secondly, we complement MAXQwith learning from demonstrations by explicitly allowing theproposed system to ask for human expert knowledge if thepreviously gathered experiences do not suffice to confidentlypropose an action. The decision for either way of learning istaken incrementally in every encountered situation and for allsubtasks within the hierarchy. This leads to a coherent systemwhere the same logic is applied to all subtasks on everylevel of abstraction. By leveraging the strengths of both,reinforcement learning and learning from demonstrations,our combined approach is able to solve tasks in large statespaces more quickly than either learning method alone andwith little involvement on the human expert’s side.

A. Continuous MAXQ Learning with Uncertainties

The MAXQ framework has been proposed by Diet-terich [2] to learn recursively optimal hierarchical policiesfor (Semi-) Markov Decision Processes. Its key idea is to re-cursively decompose the value V π(i, s) of state s in subtaski given a fixed policy π into the sum of the value V π(a, s)of subtask a chosen according to πi(s) and the cumulativereward Cπ(i, s, a) for completing i after finishing a. Inthis context, Cπ(i, s, a) is called the completion function ofsubtask i in situation s for action a. Following the notationof [2], this can be expressed as follows:

Qπ(i, s, a) = V π(a, s) + Cπ(i, s, a),

V π(i, s) =

Q(i, s, πi(s)) if i composite∑s′

P (s′|s, i)R(s′|s, i) if i primitive,

(1)

where R is the single-step reward received if executing aprimitive action i in state s leads to a transition to states′, and P is the probability of the transition to s′. One ofthe major differences between MAXQ and other methods forhierarchical reinforcement learning is that the value functionsof subtasks in MAXQ are context-free, i.e., they do notdepend on the context their subtask was invoked from. Thisallows to reuse subtask policies in different task contexts andto benefit from state abstraction.

Based on this decomposition, Dietterich et al. extendedQ-Learning [16] to the MAXQ hierarchy and proved itsconverge to a recursively optimal policy. For every step ofa primitive task, the value function of the task is updatedusing the experienced reward. Once a subtask is completed,the completion function of its parent is updated with theestimated value of the new state in the parent task. Com-puting this estimate involves recursively unfolding Eq. (1),replacing the fixed policy with a max operator that selectsthe best action in every step. To avoid the hierarchicalcredit assignment problem during learning, the designer of aMAXQ hierarchy may assign different kinds of rewards tosubtasks on different levels.

Learning a hierarchical policy with the MAXQ frameworkrequires to maintain the completion values of all compositesubtasks and value functions for all primitive tasks. Incontinuous domains, explicitly representing these quantities

for all states is infeasible. At the same time, in practicalapplications, a reinforcement learning agent will only evervisit a small part of the state and action spaces. A commontechnique is therefore to represent the value function usingfunction approximation techniques. In this paper, we inves-tigate this idea in the context of hierarchical reinforcementlearning by applying GPR to the MAXQ learning algorithm.

GPR is a technique for non-parametric Bayesian regres-sion which yields predictions with an associated uncertaintyof function values at unseen inputs. For every subtask a onevery level i, we propose to maintain a GP approximation ofthe completion function C(i, s, a) or the value function V (s)in the case of primitive actions. Estimates of completionvalues and values of primitive actions can then be obtained inthe form of predictions based on a prior on the approximatedfunction and a set of gathered training samples.

From these, we compute an estimate of the value Q(i, s, a)by recursively aggregating estimates and their uncertaintiesthroughout a subtask hierarchy, analogous to Eq. (1):

Q(i, s, a) ∼ N

(µQ

σ2Q

)=

(µV (a,s) + µC(i,s,a)

σ2V (a,s) + σ2

C(i,s,a)

). (2)

Here, V (a, s) decomposes recursively into the sum of nor-mally distributed GP estimates for completion values andvalues of primitive actions, as described by Eq. (1). Due tothe generalization abilities of the Gaussian Process approx-imation, the accuracy of estimates increases as the systemgathers experience from interacting with the environment.

B. Expected Deviation

In practical applications, obtaining samples for reinforce-ment learning is often expensive. It is therefore importantthat learning algorithms exploit information efficiently andattain an effective exploration policy with a minimal numberof trials. Bayesian Reinforcement Learning [17] allows tomake an informed decision for a promising next actionby considering the uncertainty associated with Bayesianestimates of achievable performance. While the evolution oftotal accumulated reward over time is a relevant measure forthe theoretical analysis of algorithms and for performancecomparisons, practical applications also need to considersafety when choosing actions. Greedy optimization of re-wards may easily lead to the exploration of actions with fatalconsequences for the robot or its environment.

In previous work, we have introduced an explorationstrategy based on expected deviation [4]. It allows forsafe, yet data-efficient learning by balancing the expectedimprovement [18] and degradation of candidate actions. Theexpected improvement ED⊕ equals the expected value ofthe predicted improvement in terms of value of a candidateaction over the value Qbest of the best known action. Simi-larly, the expected degradation ED measures how much thevalue of a candidate action is expected to fall short of Qbest.Both quantities are computed directly from the distributionof the estimated state-action value Q(x) of a candidate input.Incorporating a trade-off with the expected degradation intothe optimization of the expected improvement fosters safety

Fig. 2. Strategy the system has to learn in order to relocate a stack of three boxes without collapsing it. It involves splitting the stack by moving the topboxes to a temporary location (yellow). In a second step, the remaining stub may be moved to the target location (green) where it will be joined with theupper boxes in a third step. The task is complicated by a static obstacle (gray box) in the center of the workspace.

since it depreciates actions that seem promising because alarge uncertainty in their value estimate leaves an opportunityfor large improvement.

In this work, we apply Bayesian expected deviationlearning to MAXQ hierarchical reinforcement learning. InMAXQ, state-action values are not represented explicitly, butstored implicitly as completion and primitive state valuesthat are approximated by a Gaussian Process in our ap-proach. From these, we recursively compute a probabilisticestimate of the value of candidate state-action pairs usingEq. (2) needed to compute the expected improvement anddegradation. The resulting objective function on the expecteddeviation ED is defined as:

ED(x) := ED⊕(x,Qbest)− f(ED, Qbest) (3)

f(ED, Qbest) =(Qmax −

(Qbest −ED(x,Qbest)

))2,

where the maximum achievable value Qmax—which is usu-ally known at design time—is used as a normalizer.

Depending on the structure of the space of actions, dif-ferent strategies may be used to optimize this quantity. Notethat optimization of the expected deviation does not involvesampling rewards from the environment. Rather, the value ofactions is predicted efficiently using the GP approximation.

C. Learning From DemonstrationInformed exploration relies on the availability of prior

experience to base decisions on. Therefore, reinforcementlearning algorithms are often accelerated by combining themwith learning from demonstrations (LfD) [19]. Furthermore,demonstrations are a convenient and familiar way of convey-ing knowledge for human teachers.

In this work, we apply learning from demonstrations toMAXQ by allowing our system to ask for human assistancefor any subtask in the hierarchy. The human expert isthen expected to demonstrate a solution for the particularsubtask using the lower-level actions available to the subtask.The recorded action sequence is subsequently executed andrewards are collected for the visited state-action pairs. Oncethe subtask is completed, the completion values are updatedin reverse order from the terminal state to the state where thedemonstration was originally requested. This way, wheneveran update is computed, it benefits from the preceding updateof the value of its successor. Since subtasks in the MAXQhierarchy are context-free and due to the generalizationabilities of the GP approximation described in Sec. III-A,demonstrations naturally propagate to similar situations anddifferent tasks sharing the same subpolicy.

D. Learning Method Selection

Learning from demonstrations and learning from experi-ence gathered through interaction with the environment bothoffer unique strengths but also have weak points. Whereasdemonstrations are an effective way of transferring existingexpert knowledge and task constraints to robots withoutexpressing them explicitly, they are often expensive to obtain.On the other hand, reinforcement learning allows to adaptexisting solutions to improve their effectiveness and to applythem under changing conditions. It can be highly ineffectivethough, if no prior information is available. Combining bothways of finding task solutions in a flexible way thereforeis a promising direction towards the development of moreefficient learning methods. Accordingly, the system presentedin this work integrates learning from imitation and rein-forcement learning as complementary control flows. On alllevels of the hierarchy, whenever an action needs to betaken, the system autonomously and independently for everysubtask decides for the most appropriate way of findinga solution based on the situation at hand. This way, thehuman involvement is focused on areas where autonomouslearning cannot be performed safely and at the same time theoverall effort is reduced by allowing the system learn fromautonomous interaction where appropriate.

The individual decisions for learning from demonstrationsor reinforcement learning are based on previously gatheredexperiences that are generalized by GPR. If there is enoughknowledge to generate a solution that does not entail therisk of damaging the robot by executing actions with utterlyunknown outcome, reinforcement learning is chosen. If theavailable knowledge is insufficient to propose a solution tothe task, e.g. because a similar situation was never observedbefore or all evaluated solutions turned out to be poor, oursystem will ask for a demonstration of the (sub-)task.

To determine whether a suitable and safe action can begenerated for the current situation scurr, we consider pairs(a, s) of previously taken actions that led to a positive rewardand the situation they were taken in. Among them, we searchfor an optimal trade-off between value and similarity to thecurrent situation scurr by optimizing

x = argmin(s,a)∈X

α (Qmax − µQ(s, a)) + ‖scurr − s‖ . (4)

Here, X ⊆ S×A is the set of positive training samples and αexpresses the relative preferences for high value or similarityto the current situation. If optimization yields an actionpromising good performance while being sufficiently similar

to previously executed actions, i.e., its score undercuts apredefined threshold θ, it is taken as a starting point forreinforcement learning. If no such action is found, our systemasks for assistance from a human expert. The thresholddetermines the carefulness of the system. If θ is set to a smallvalue, the system only attempts actions on its own if theyexhibit little risk and draws on human expert demonstrationsotherwise. Given a large θ, the system will lean on expertknowledge less often which in turn entails an increased riskof failing a task.

IV. EXPERIMENTS

To evaluate the performance of our approach, we choosea simplified version of a typical manipulation task from oureveryday life that involves choosing among manipulationstrategies based on the situation at hand. The task is torelocate a stack of small boxes on a table safely to adesignated position. On this task, we conducted severalexperiments using the physics-based simulator Gazebo [20].They demonstrate that the proposed system successfully andsafely learns to solve the task from various situations.

A. Task Description

In our experiments, each episode consists of a stackof boxes that is placed in the robot’s workspace and adesignated target position where the stack—or parts of it—have to be placed. To indicate what part of the stack needsto be moved, the lowest affected box is highlighted. At thestart of each episode, the real-valued coordinates of the initialand target positions in the robot’s workspace, as well asthe total number of boxes on the stack and the number ofboxes to be displaced are chosen at random. This allowsthe system to learn the task under various conditions. Thetask is complicated by a static obstacle placed at the centerof the robot’s workspace, preventing a displacement on astraight line between certain locations of the stack and somegoal locations. On the other hand, actions are rewardedaccording to the work they consume—so the system benefitsfrom pushing instead of lifting a box if possible. A secondcomplication arises from the physical interactions of thestacked boxes modelled by our simulation, causing stacksto become instable and to collapse if more than two boxesare relocated at the same time. Here, the system has to learnto lift part of the stack off to a temporary location and toreassemble the partial stacks at the destination to limit thenumber of boxes that are moved at the same time. Thisstrategy is depicted in Fig. 2.

B. Setup

In order to apply the proposed algorithm to this task, wedefine a MAXQ hierarchy with subtasks at three levels ofabstraction (see Fig. 3).

1) Action spaces: On the topmost, most abstract level,there is only a single action that solves the entire task ofmoving a stack of boxes. To achieve this goal, the system hasto learn to decide for a manipulation strategy and to combineintermediate level actions accordingly. On the intermediate

graspon table

pushon table

moveon table

retracton table

graspon stack

pushon stack

moveon stack

retracton stack

grasp push move retract

movebox

from table

movebox

from stack

move boxfrom table

move boxfrom stack

movestack

Fig. 3. MAXQ graph [2] illustrating the sample task used in ourexperiments. MAX nodes represent subtasks and are depicted as triangles,Q nodes correspond to actions available to subtasks and are displayed asrectangles. Every Q node maintains the value of completing its parent taskafter executing its child action.

level, there are two tasks for the top-level action to choosefrom, corresponding to the possible positions of a box in astack: move a box that is located directly on the table andmove a box that is located at a higher position in the stack.Every iteration involves moving a single box—includingall boxes on top of it—to a designated target position.Sometimes, the whole top-level task may be completed witha single intermediate action that moves an entire stack. Inother situations, several intermediate actions on partial stacksmay need to be combined to solve the whole task. Besideslearning to choose an appropriate action in every situation,the system has to select the source location of the box and adesired target location. Locations have to be chosen from apredefined set of reference points. Reference points representtask-specific locations independent from their coordinates ina particular instance of the task. They are predetermined bythe designer of the MAXQ hierarchy. For our experiments,the initial location of the stack, the desired goal locationof the stack, and the location of a temporary depositoryhave been identified as relevant for the top-level task. Foreach of them, there are reference points at different heightscorresponding to the positions within the stack.

In order to complete an intermediate task, there are fourlow-level motion primitives the system may combine: grasp-ing an object, lifting an object to another position, pushingan object, and retracting the hand to a resting position. Forany situation, the system has to choose the correct action aswell as two reference point IDs determining the start and thegoal of the desired movement. On the intermediate level, theset of available reference points includes the start and targetlocations determined by the action parameters from the toplevel and a predefined resting position the manipulator mayassume between actions.

TABLE IACTION AND STATE REPRESENTATIONS IN OUR EXPERIMENTS

level action variables

topaction ∈ {move from stack, move from table}start ∈ {initial location, goal, depository} × 3goal ∈ {initial location, goal, depository} × 3

inter-mediate

action ∈ {grasp, push, move, retract}start ∈ {start, goal, resting position}goal ∈ {start, goal, resting position}

bottom −

level state variables

topbox location ∈ {initial location, goal,

depository, absent} (per box)selected box ∈ {0, . . . , number of boxes}

inter-mediate

box location ⊂ R2

goal ⊂ R2

holding object ∈ {0, 1}current location ∈ {start, goal, resting position} × 3

bottom movement start ⊂ R2

movement goal ⊂ R2

Actions on the lowest level of the MAXQ hierarchycorrespond to movement primitives that are treated as atomicactions in our experiments. For the purpose of our exper-iments, the parameters of the movement primitives weretrained beforehand using the approach described in ourearlier work [21]. Given the ID of a movement primitiveand the required reference point IDs, our system generates atrajectory using a controller that computes the effects of themovement and updates the simulated environment.

Tab. I summarizes the variables describing the actionspaces on the different levels of the task hierarchy. Through-out our experiments, we assume that the effects of actionsare deterministic.

2) State spaces: Using a MAXQ decomposition of ourtask and its value function allows us to benefit from stateabstraction by factoring the state space in order to work withsmaller subspaces on different levels of the task hierarchy.This reduces the complexity of the individual subtasks andaccelerates learning. On the top level, a state encompassesan ID for each box indicating whether the box is locatedat the random initial position of the stack, the randomtarget location, the temporary depository, or whether it islocated outside the workspace. A further ID value designatesthe height of the box that should be displaced. On theintermediate level, the state space is reduced to containonly the information relevant for the subtask on this level,i.e. parameters referring to the box to be displaced. Theseare the current coordinates of the box to be displaced, thecoordinates of the goal position, whether the robot is cur-rently holding the object and a reference point ID indicatingthe current position of the manipulator. Although we arenot concerned with learning motion primitive parametersin this work, rewards received on the lowest level are stillrepresented with a distinct GP model per motion primitiveso appropriate actions can be selected on higher levels.

A situation on this level is defined by the start and goalcoordinates of executed primitives. The state variables on alllevels are summarized in Tab. I.

3) Rewards: In the proposed approach, rewards are as-signed individually to subtasks on different levels of a hier-archical task. This allows us to reward or penalize specificsubtasks without influencing their parents that may not havecontributed to the subtask’s performance. This is also referredto as the hierarchical credit assignment problem [2].

To assess the performance of a low-level task, we measurethe amount of physical work involved in executing a motionprimitive of length T for a given pair of start and goalparameters. It is defined as:

W =

T∑t=1

|Ek(t)− Ek(t− 1)| ; Ek(t) =1

2mv2(t), (5)

where Ek is the kinetic energy, computed from the mass mand the velocity v at time t of the moved objects.

On the intermediate level, we detect whether there was acollision between the manipulated stack and other elementsof the world. Action sequences leading to a collision arepenalized with a negative reward. Instead, if all unconcernedobjects are still at their original position and no collision isdetected, a reward of zero is assigned. On the top level, weconsider the degree to which the original task was completedto assign a reward. If the task was completed without alteringthe environment in unintended ways, a positive reward isassigned. If the environment was altered while executingthe task, the chosen action is penalized with a negativereward. In all other cases, a neutral reward of zero is assignedand further iterations are taken until the task terminates. Insummary, rewards are assigned according to

R(a, s) =

TOP LEVEL

−1 terminated with side effects,1 success,0 else,

INTERMEDIATE LEVEL

−1 terminated with collision or side effects,0 else,

BOTTOM LEVEL

−W every action.

Rather than approximating the Q function with a singleGP using a fixed kernel width, we model each of thepossible kinds of rewards with a separate GP and combinethe predictions in a Gaussian mixture model. This allows usto assign different weights to the components. Throughoutthe experiments described in this section, we set the kernelwidths to 0.2 on the top level and to 0.35 on the lower levels,respectively, for the successful and 0.055 for the unsuccessfulcases. Hence, unsuccessful cases have a more local influence.The threshold θ used to make a decision between reinforce-ment and imitation learning was set to θ=0.3 on the top leveland 0.25 on the lower levels, respectively.

V. RESULTS

To demonstrate how the proposed approach benefits fromboth, the hierarchical structure of the task and the combi-nation of reinforcement learning with learning from demon-strations, we ran 300 random episodes of the task.

Fig. 4 depicts the sequence of decisions for a learningmethod made by the system over the course of the ran-dom episodes. It clearly shows that the system successfullylearned the task from as few as 18 demonstrations across alllevels. On the top level, human assistance was requested ina total of six episodes, corresponding to each of the possibleconfigurations of the stack. All requests for a demonstrationwere made during the first 16 episodes. Three of six wereeven made during the first three episodes of our experiment.This is in line with the expectation that the system does notperform random exploration if no prior knowledge is avail-able and on the other hand no longer requests demonstrationsif sufficient experiences have been gathered. Compared toapproaches that require the teacher to bootstrap the learningprocess with a number of carefully selected training samples,the proposed system requests demonstrations on-demandwhenever an unknown situation is encountered. Bootstrap-ping therefore is not necessary in our approach although itcan be applied to incorporate available domain knowledge.The occasional late demonstrations are accounted for bythe random generation of task parameters that sometimesproduce novel task configurations late in the process. Forinstance, in episode 16 of our experiments, the task ofdisplacing the whole stack with three boxes was drawn forthe first time and led our system to request a demonstra-tion. A similar effect can be observed on the intermediatelevel where five of seven demonstrations of moving a boxthat is located directly on the table are requested duringthe first 27 episodes. Still, the final demonstration is notrequested until episode 66 where a box has to be put ontop of another box. Again, no similar configuration in thiscorner of the workspace had been observed before. Theability to incorporate demonstrations late in the processis advantageous in case the conditions of a task changeand additional knowledge is needed to solve the task. Inthis respect, approaches that merely use demonstrations toobtain an initial policy for reinforcement learning are limitedcompared to our approach. On the intermediate level, thetask of moving a box that is located on top of another boxis simpler than the task of moving a box located directly onthe table since for the latter, the system needs to considerthe obstacle on the table in its decision for a learningmethod. Consequently, only five demonstrations are requiredthroughout the first 29 episodes in order to learn this task.

Looking at all three plots side by side reveals that therequests for demonstrations were often made at differentepisodes on each level, i.e. a lack of information to generatea safe solution on one level does not necessarily imply thatsubtasks on lower levels also cannot be solved without ademonstration and vice versa. Taking the decision for alearning method individually on each level therefore saves

episode

episode

1 50 100 150 200 250 300episode

episode

episode

1 50 100 150 200 250 300episode

Fig. 4. Choices made by our system during an experiment with 300episodes of the full task with up to three boxes. Each bar depicts the choicesmade for a particular subtask and every episode is represented by a segment.The top bar represents decisions made for the top-level task, the lowerbars represent the alternate tasks on the intermediate level. Consequentlyeach segment is marked in only one of them. Blue (dark) segments denoteepisodes where the system asked for a demonstration. Episodes where thesystem selected an action autonomously are marked in green (light).

human effort for unnecessarily complex demonstrations. Forinstance, a demonstration is requested for moving the greenbox that is located on top of two other boxes. The demon-stration in this case involves choosing an appropriate subtaskand its parameters—in this case “move from stack”. Thesystem is then able to apply the primitive actions involvedin this subtask autonomously, drawing on experiences frompast episodes and different task configurations. The same istrue for the strategy needed to displace an entire stack ofthree boxes, depicted in Fig. 2. Again, the strategy on thetop level has to be demonstrated to the system but the secondintermediate level action can be executed autonomously.This is facilitated by the subtask sharing and generalizationabilities our system permitting reuse of acquired knowledgein different contexts, independent of the parent task. In thesame way, both subtasks on the intermediate level draw onthe same four primitive actions. There are also instanceswhere a top level task may be completed without assistancebut a demonstration is needed to solve a subtask on alower level. For example, in episode five, a demonstration isrequested on the intermediate level because little informationis available for that part of the continuous state space of thatsubtask, while the parent task is solved autonomously.

In summary, our experiments confirm that the systemquickly learns to apply different strategies depending on thenumber of boxes in a stack and to split and reassemble largestacks to prevent them from collapsing. Similarly, it quicklyapprehends that moving objects across the obstacle requireslifting them over whereas pushing objects is the more energyefficient solution in all other cases where the object is locateddirectly on the table. While the system initially clings tothe demonstrated solutions, it starts to apply solutions fromdifferent contexts as more samples become available andthe certainty about the value of the demonstrated solutionsincreases. It does not, however, attempt arbitrary actionswith unknown outcome but instead recombines previously

-0.5-0.4-0.3-0.2-0.10.0y

0.35

0.40

0.45

0.50

0.55x

-0.5-0.4-0.3-0.2-0.10.0y

0.35

0.40

0.45

0.50

0.55

x

Fig. 5. Randomly generated tasks where the system chose suboptimalactions for the “move box from table” subtask. The top graph depicts settingswhere a box was lifted to a destination although simply pushing it wouldhave been the more efficient solution. Arrows in the bottom graph representsettings where the system caused a collision by trying to push the object.

seen actions. For instance, objects are unnecessarily liftedin episodes 22, 24 and 51. In episodes 200 and 221, thesystem even tries to push a box even though the obstacleobstructs the way as depicted in Fig. 5. Collisions result inpunishing rewards and are abandoned in favor of the correctand energy-efficient solution quickly. Collisions were onlyattempted once for every direction across the obstacle.

Throughout our experiments, the instances mentionedabove were the only instances where the system did not cor-rectly solve a subtask. On the top level, there was no failure atall since the safety term in our exploration strategy preventedthe system from choosing actions it hadn’t observed before.

VI. CONCLUSION

In this paper, we presented a new system for hierarchicalrobot learning that combines reinforcement learning andlearning from demonstrations to teach a robot complextasks from human everyday life. Our method applies thewell-known MAXQ method for hierarchical reinforcementlearning to continuous state spaces by approximating com-pletion values on compound levels and values of primitivestates for all subtasks with GPs. This allows us to com-pute predictions of the value of unknown state-action pairswith an associated uncertainty. From these, we derive aBayesian exploration criterion that safely optimizes the valueof actions by trading off the expected improvement anddegradation of GP estimates, thereby protecting the robotfrom actions with unpredictable outcome. To reduce thenumber of trials needed to find a good policy, we comple-ment our reinforcement learning approach with learning fromhuman expert demonstrations. Rather than limiting the use ofdemonstrations to the initialization of reinforcement learning,we integrate both ways of learning as alternate control flowsin our system. For every situation and subtask, the systemautonomously decides for either of them based on previously

gathered experiences. This way, we efficiently exploit thecomplementary strengths of both approaches without inherit-ing their individual shortcomings. We evaluated our approachon a challenging box-stacking task involving a hierarchy ofsubtasks and different manipulation strategies that need tobe chosen appropriately. Our results demonstrate that theproposed system is able to successfully learn the task andbenefits from its hierarchical structure and the independentdecision for a learning method for every subtask. We wouldlike to dedicate future work to the development of a coherentsystem with the ability to simultaneously acquire low-levelmotion primitives and complex task skills on a real robot.

REFERENCES

[1] B. Hengst, “Hierarchical Approaches,” in Reinforcement Learning:State of the Art. Springer Berlin Heidelberg, 2012, ch. 9, pp. 293–323.

[2] T. G. Dietterich, “Hierarchical Reinforcement Learning with theMAXQ Value Function Decomposition,” Journal of Artificial Intel-ligence Research, vol. 13, pp. 227–303, 2000.

[3] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes forMachine Learning. The MIT Press, 2006.

[4] K. Grave, J. Stuckler, and S. Behnke, “Improving Imitated GraspingMotions through Interactive Expected Deviation Learning,” in Proc.IEEE-RAS Int. Conf. on Humanoid Robots, 2010, pp. 397–404.

[5] J. Kober, D. Bagnell, and J. Peters, “Reinforcement Learning inRobotics: A Survey,” Int. Journal of Robotics Research, 2013.

[6] C. Daniel, G. Neumann, O. Kroemer, and J. Peters, “Learning Se-quential Motor Tasks,” in Proc. IEEE Int. Conf. on Robotics andAutomation, 2013.

[7] J. Peters, K. Muelling, and Y. Altun, “Relative Entropy Policy Search,”in Proc. Nat. Conf. on Artificial Intelligence, 2010.

[8] F. Stulp and S. Schaal, “Hierarchical Reinforcement Learning withMovement Primitives,” in Proc. IEEE-RAS Int. Conf. on HumanoidRobots, 2011, pp. 231–238.

[9] E. Theodorou, J. Buchli, and S. Schaal, “A Generalized Path IntegralControl Approach to Reinforcement Learning,” Journal of MachineLearning Research, vol. 11, pp. 3137–3181, Dec. 2010.

[10] G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto, “Robot Learn-ing from Demonstration by Constructing Skill Trees,” Int. Journal ofRobotics Research, vol. 31, no. 3, pp. 360–375, 2012.

[11] F. Cao and S. Ray, “Bayesian Hierarchical Reinforcement Learning,”in Proc. Neural Information Processing Systems, 2012, pp. 73–81.

[12] N. K. Jong and P. Stone, “Compositional Models for Reinforce-ment Learning,” in Machine Learning and Knowledge Discovery inDatabases. Springer, 2009, pp. 644–659.

[13] R. I. Brafman and M. Tennenholtz, “R-MAX - A General PolynomialTime Algorithm for Near-Optimal Reinforcement Learning,” TheJournal of Machine Learning Research, vol. 3, pp. 213–231, 2003.

[14] G. J. Gordon, “Stable Function Approximation in Dynamic Program-ming,” Carnegie-Mellon University, Pittsburgh, PA, Tech. Rep., 1995.

[15] A. Bai, F. Wu, and X. Chen, “Online Planning for Large MDPs withMAXQ Decomposition,” in Proc. Int. Conf. on Autonomous Agentsand Multiagent Systems, vol. 3, Richland, SC, 2012, pp. 1215–1216.

[16] C. Watkins and P. Dayan, “Q-Learning,” Machine learning, vol. 8, no.3-4, pp. 279–292, 1992.

[17] N. Vlassis, M. Ghavamzadeh, S. Mannor, and P. Poupart, “BayesianReinforcement Learning,” in Reinforcement Learning. Springer BerlinHeidelberg, 2012, ch. 11, pp. 359–386.

[18] D. R. Jones, “A Taxonomy of Global Optimization Methods Based onResponse Surfaces,” Journal of Global Optimization, vol. 21, no. 4,pp. 345–383, 2001.

[19] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A Surveyof Robot Learning from Demonstration,” Robotics and AutonomousSystems, vol. 57, no. 5, pp. 469–483, 2009.

[20] N. Koenig and A. Howard, “Design and Use Paradigms for Gazebo,An Open-Source Multi-Robot Simulator,” in Proc. IEEE/RSJ Int. Conf.on Intelligent Robots and Systems, vol. 3, 2004, pp. 2149–2154.

[21] K. Grave and S. Behnke, “Incremental Action Recognition and Gener-alizing Motion Generation based on Goal-Directed Features,” in Proc.IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2012, pp. 751–757.

Bayesian Exploration and Interactive Demonstration in ...Bayesian Exploration and Interactive...

Documents

Transcript of Bayesian Exploration and Interactive Demonstration in ...Bayesian Exploration and Interactive...