Model-based Reinforcement Learning with Neural Networks on...

Model-based Reinforcement Learning with Neural Networkson Hierarchical Dynamic System

Akihiko Yamaguchi and Christopher G. AtkesonRobotics Institute, Carnegie Mellon University, Pittsburgh PA

[email protected]

Abstract

This paper describes our strategy to approach re-inforcement learning in robotic domains includingthe use of neural networks. We summarize ourrecent work on model-based reinforcement learn-ing where models of hierarchical dynamic systemare learned with stochastic neural networks [Ya-maguchi and Atkeson, 2016b], and actions areplanned with stochastic differential dynamic pro-gramming [Yamaguchi and Atkeson, 2015]. Espe-cially this paper clarifies why we believe our strat-egy works in complex robotic tasks such as pour-ing.

1 IntroductionThe deep learning community is rapidly growing because ofits successes in image classification [Krizhevsky et al., 2012],object detection [Girshick et al., 2014], pose estimation [To-shev and Szegedy, 2014], playing the game of Go [Silver etal., 2016] and so on. Researchers are interested in its ap-plications to reinforcement learning (RL) and robot learn-ing. Some attempts have been made [Mnih et al., 2015;Levine et al., 2015a; Pinto and Gupta, 2016; Yamaguchi andAtkeson, 2016b].

Although (deep learning) neural networks are a powerfultool, we do not think they solve the entire robot learning prob-lem. Robot learning involves many different types of compo-nents; not only function approximation, but also what to rep-resent (policies, value functions, forward models), learningschedules (e.g. [Asada et al., 1996]), and so on (cf. [Koberet al., 2013]). In this paper we describe our strategy for therobot learning problem that is based on the lessons we learnedfrom many practical robot learning case studies. We empha-size the importance of model-based RL (generalization abil-ity, reusability, robust to reward changes) and symbolic repre-sentations (reusability, numerical stability, ability to transferfrom humans). As a consequence, we are exploring model-based RL with neural networks for hierarchical dynamic sys-tem; learning models of hierarchical dynamic system withstochastic neural networks [Yamaguchi and Atkeson, 2016b],and planning actions with stochastic differential dynamic pro-gramming [Yamaguchi and Atkeson, 2015]. An example of

Fig. 1: A state machine of pouring behavior proposed in [Ya-maguchi et al., 2015]. This is a symbolic representation ofpolicy, and there is a corresponding hierarchical dynamic sys-tem.

hierarchical dynamic system is illustrated in Fig. 1 which is astate machine of pouring behavior [Yamaguchi et al., 2015].

In the next section, we discuss about our strategy. In thesucceeding sections, we summarize our recent work [Yam-aguchi and Atkeson, 2016b; Yamaguchi and Atkeson, 2015].

2 A Practical Reinforcement LearningApproach in Robotic Domains

In order to introduce our strategy, first we emphasize the im-portance of the model-based reinforcement learning (RL) ap-proach and symbolic representations.

2.1 Model-based RLAlthough the current popular approach of RL in robot learn-ing is a model-free approach (cf. [Kober et al., 2013]), thereare many reasons to use a model-based RL; dynamical mod-els are learned from samples, and actions are planned withdynamic programming.

Generalization ability and reusability There is somemodel-free RL work trying to increase generalization (e.g.[Kober et al., 2012; Levine et al., 2015b]). Model-based ap-proaches can improve generalization by choosing adequatefeature variables that capture the task variations. Addition-ally model-based approaches can reuse learned models moreaggressively. Let us assume a case where the dynamics of arobot change (e.g. we switch a robot) but we still conduct thesame manipulation task. The dynamics of the manipulatedobject do not change, thus we can expect to reuse some dy-namical models without additional learning. For example in a

pouring task (cf. [Yamaguchi et al., 2015]), the dynamics ofpoured material flow produced by tipping a container wouldbe common between different robots, except for a small effectcaused by the robots (e.g. a subtle vibration).

Another case of aggressive reuse is sharing models in dif-ferent strategies. For example in pouring, there are differentstrategies to control flow, such as tipping and shaking a con-tainer. Sharing knowledge among strategies is possible at adetailed level with the model-based approach. For example,after material comes out from the source container and flowhappens, the dynamical model between the flow state (flowposition, flow variance) and the amount poured into the re-ceiver or spilled onto a table can be shared between strategies.Such sharing or reusing would not be easy in the model-freeapproach.

Robust to reward changes During learning a task, thereare some factors causing changes of rewards, such as targetchanges (e.g. target positions in reaching an robot arm; i.e.an inverse kinematics problem), and reward shaping.

In these situations, model-based methods would be moreuseful than model-free methods. Let us consider an in-verse kinematics problem. We want to know joint anglesq of a robot for a given target position x∗ in Cartesianspace. Learning an inverse model x∗ → q becomes com-plicated especially when: (1) number of joints are redundant,i.e. dim(q) > dim(x∗), and (2) the opposite of (1), i.e.dim(q) < dim(x∗). In (1), there are many (usually infinite)solutions of q for a single x∗ (ill-posed problem). Handlingthis issue is not trivial with the model-free approach that triesto learn directly the inverse kinematics. An example of (2)is the inverse kinematics of android face (soft skin); q is dis-placements of the actuators inside the robotic face, and x∗

is positions of feature points (e.g. 17 points). Magtanonget al. proposed learning forward kinematics with neural net-works, and solving the inverse problem with an optimizationmethod [Magtanong et al., 2012]. This is an example of amodel-based approach. They tried to learn inverse kinematicswith neural networks, but the estimation becomes very inac-curate especially when some elements in a target x∗ are outof reachable range. Their model-based approach generatesminimum-error solutions in such situations.

Sometimes we make a goal easier to learn by manipulatingthe reward function. Such a way is called reward shaping (e.g.[Ng et al., 1999; Tenorio-Gonzalez et al., 2010]). This idea isalso useful when a task involves multiple objectives or con-straints; e.g. in a pouring task [Yamaguchi et al., 2015], theobjective is pouring into a receiving container while avoidingspilling. Since reward changes do not affect the dynamicalmodels, we can generate new optimal actions if the modelsare learned adequately.

2.2 Symbolic RepresentationsHierarchical Dynamic Systems The aggressive modelreuse as mentioned above works well when the entire dynam-ical system is decomposed into sub-dynamical systems. Werefer to such a representation as hierarchical dynamic system.

The hierarchical dynamic system is also powerful to dealwith simulation biases [Kober et al., 2013]. When forward

models are inaccurate (this is usual when learning models),integrating the forward models represented by differentialequations causes a rapid increase of future state estimation er-rors (cf. [Atkeson and Schaal, 1997]). We model the mappingfrom input to output variables of each sub-dynamical system,so that we can reduce the rapid error increase by integrals.After the task-level robot learning researches [Aboaf et al.,1988; Aboaf et al., 1989], we can consider our representationas a (sub)task-level dynamic system. We still need to esti-mate future states through decomposed dynamics, which willcause an error accumulation, but its speed is usually muchreduced.

Symbolic Policy Representation Symbolic representationof policies is also useful. Especially in a complicated task likepouring [Yamaguchi et al., 2015], symbolic policies are es-sential. There are two major reasons: (A) Humans can moreeasily transfer their knowledge of symbolic policies to robots.(B) Symbolic policies are sharable in different domains. Weimplemented the pouring behavior for a PR2 robot and wecould easily apply it to a Baxter robot; modifying symbolicpolicies were not necessary although some low-level imple-mentations were different such as inverse kinematics and jointcontrol [Yamaguchi and Atkeson, 2016a]. (C) A symbolicpolicy introduces a structure into the task. Such a decompo-sition increases the reusability of sub-skills. For example in[Yamaguchi et al., 2015], most sub-skills are shared betweenpouring and spreading. Pouring pours material from a sourcecontainer to a receiver, while spreading spreads material froma source container to cover the surface of a receiver (e.g.spread butter on bread). Pouring and spreading are sharingmany sub-skills; the major difference is moving the mouthof the source container to cover the surface in spreading. Thesymbolic policy representation enabled us to reuse many sub-skills between pouring and spreading.

2.3 Model-based RL with Neural Networks onHierarchical Dynamic System

Our strategy for RL in robotic domains is summarized as fol-lows: (A) The robot represents high-level policies with sym-bolic policies, and learns them from humans. (B) The robotdecomposes the task dynamics into symbols along the sym-bolic policies. (C) The robot learns each sub-dynamical sys-tem with numerical methods if the dynamical system is notmodeled. Basically we model the mapping from input to out-put variables (subtask-level dynamic system). If an analyticalmodel is available, the robot uses it. (D) The robot plans ac-tions by solving dynamic programming over the models.

For (D), we use a stochastic version of differential dynamicprogramming (DDP) to plan actions [Yamaguchi and Atke-son, 2015]. For (C), we extended neural networks to be us-able with stochastic DDP [Yamaguchi and Atkeson, 2016b].Stochastic modeling of dynamics is also helpful to deal withsimulation biases.

Fig. 2: Neural network architecture [Yamaguchi and Atke-son, 2016b]. It has two networks with the same input vector.The top part estimates an output vector, and the bottom partmodels prediction error and output noise. Both use ReLU asactivation functions.

3 Neural Networks for Regression withProbability Distributions

We extended neural networks to be capable of: (1) model-ing prediction error and output noise, (2) computing an out-put probability distribution for a given input distribution, and(3) computing gradients of output expectation with respectto an input. Since neural networks have nonlinear activationfunctions (in our case, rectified linear units; ReLU), these ex-tensions were not trivial. [Yamaguchi and Atkeson, 2016b]gives an analytic solution for them with some simplifications.

We consider a neural network with rectified linear units(ReLU; frelu(x) = max(0, x) for x ∈ R) as activation func-tions. Based on our preliminary experiments with neural net-works for regression problems, using ReLU as an activationfunction was the most stable and obtained the best results.Fig. 2 shows our neural network architecture. For an inputvector x, the neural network models an output vector y withy = F(x), defined as:

h1 = f relu(W1x+ b1), (1)h2 = f relu(W2h1 + b2), (2). . .

hn = f relu(Wnhn−1 + bn), (3)y = Wn+1hn + bn+1, (4)

where hi is a vector of hidden units at the i-th layer, Wi andbi are parameters of a linear model, and f relu is an element-wise ReLU function.

Even when an input x is deterministic, the output y mighthave error due to: (A) prediction error, i.e. the error be-tween F(x) and the true function, caused by an insufficientnumber of training samples, insufficient training iterations,or insufficient modeling ability (e.g. small number of hid-den units), and (B) noise added to the output. We do notdistinguish (A) and (B). Instead, we consider additive noisey = F(x) + ξ(x) where ξ(x) is a zero-mean normal dis-tribution N (0,Q(x)). Note that since the prediction errorchanges with x and the output noise might change with x, Qis a function of x. In order to model Q(x), we use another

Fig. 3: A system with N decomposed processes. The dottedbox denotes the whole process.

neural network ∆y = ∆F(x) whose architecture is similarto F(x), and approximate Q(x) = diag(∆F(x))2. ∆y is anelement-wise absolute error between the training data y andthe prediction F(x).

When x is a normal distribution N (µ,Σ), our purpose isto approximate E[y] and cov[y]. The difficulty is the use ofthe nonlinear ReLU operators f relu(p). We use the approxi-mation that when p is a normal distribution, f relu(p) is alsoa normal distribution. We also assume that this computationis done in an element-wise manner, that is, we ignore non-diagonal elements of cov[p] and cov[f relu(p)]. Although thecovariance Q(x) depends on x, we consider its MAP esti-mate: Q(µ).

4 Stochastic Differential DynamicProgramming

We use stochastic differential dynamic programming (DDP)proposed in [Yamaguchi and Atkeson, 2015]. Stochasticmeans we consider probability distributions of states and ex-pectations of rewards. This design of evaluation functions issimilar to recent DDP methods (e.g. [Pan and Theodorou,2014]). On the other hand, we use a simple gradient descentmethod to optimize actions, while traditional DDP [Mayne,1966] and recent methods (e.g. [Tassa and Todorov, 2011;Levine and Koltun, 2013; Pan and Theodorou, 2014]) use asecond-order algorithm like Newton’s method. In terms ofconvergence speed, our DDP is inferior to second-order DDPalgorithms. The quality of the solution would be the same asfar as we use the same evaluation function. Our algorithm iseasier to implement. Since we use learned dynamics models,there should be more local optima than when using analyti-cally simplified models. In order to avoid poor local maxima,our DDP uses some practical methods such as multiple start-ing points.

Fig. 3 shows a system with N decomposed processes. Theinput of the n-th process is a state xn and an action an, andits output is the next state xn+1. Note that some actions maybe omitted. A reward is given to the outcome state. Let Fn

denote the process model: xn+1 = Fn(xn,an), and Rn de-note the reward model: rn = Rn(xn). We assume that everystate is observable.

Each dynamics model Fn is learned from samples withthe neural networks where the prediction error is also mod-eled. At each step n, a state xn is observed, and then actions{an, . . . ,aN−1} are planned with DDP so that an evaluationfunction Jn(xn, {an, . . . ,aN−1}) is maximized. Jn is an ex-

pected sum of future rewards, defined as follows:

Jn(xn, {an, . . . ,aN−1}) = E[∑N

n′=n+1 Rn′(xn′)]

=∑N

n′=n+1 E[Rn′(xn′)]. (5)

In the DDP algorithm, first we generate an initial set of ac-tions. Using this initial guess, we can predict the probabilitydistributions of future states with neural networks. The ex-pected rewards are computed accordingly. Then we updatethe set of actions iteratively with a gradient descent method.In these calculations, we use our extensions of neural net-works to compute the expectations, covariances, and the gra-dients.

5 ExperimentsThis section summarizes the verification of our method de-scribed in [Yamaguchi and Atkeson, 2016b]. First we con-duct simulation experiments. The problem setup is the sameas in [Yamaguchi and Atkeson, 2015]. In Open Dynamics En-gine1, we simulate source and receiving containers, pouredmaterial, and a robot gripper grasping the source container.The materials are modeled with many spheres, simulating thecomplicated behavior of material such as tomato sauce duringshaking.

We use the same state machines for pouring as those in [Ya-maguchi and Atkeson, 2015], which are a simplified versionof [Yamaguchi et al., 2015]. Those state machines have someparameters: a grasping height and pouring position. Thoseparameters are planned with DDP.

There are four separate processes: (1) grasping, (2) movingsource container to pouring location (close to the receivingcontainer), (3) flow control, and (4) flow to poured amounts.The dimension of each state is: (2, 3, 5, 4, 2). There are 1grasping parameter and 2 pouring location parameters.

We investigated on-line learning, and compared the neu-ral network models with locally weighted regression. Af-ter each state observation, we train the neural networks, andplan actions with the models. In the early stage of learning,we gather three samples with random actions, and after thatwe use DDP. Each neural network has three hidden layers;each hidden layer has 200 units. The learning performancewith neural networks outperformed that of locally weightedregression. The amount of spilled materials was reduced.

We also tested alternative state vectors that have redun-dant and/or non-informative elements. We expected auto-matic feature extraction with (deep) neural networks. Forexample we replaced a position of container by four cornerpoints on its mouth (x, y, z respectively). Consequently thedimensions of the state vectors are changed to (12, 13, 16, 4,2). Although the state dimensionality changed from 16 to 47,the learning performance did not change.

The early results of the robot experiments using a PR2 werealso positive. A video describing simulation experiments andthe robot experiments is available:https://youtu.be/aM3hE1J5W98

1http://www.ode.org/

References[Aboaf et al., 1988] E. W. Aboaf, C. G. Atkeson, and D. J.

Reinkensmeyer. Task-level robot learning. In IEEE Inter-national Conference on Robotics and Automation, pages1309–1310, 1988.

[Aboaf et al., 1989] E. W. Aboaf, S. M. Drucker, and C. G.Atkeson. Task-level robot learning: juggling a tennis ballmore accurately. In the IEEE International Conference onRobotics and Automation, pages 1290–1295, 1989.

[Asada et al., 1996] Minoru Asada, Shoichi Noda, SukoyaTawaratsumida, and Koh Hosoda. Purposive behavior ac-quisition for a real robot by vision-based reinforcementlearning. Machine Learning, 23(2-3):279–303, 1996.

[Atkeson and Schaal, 1997] Christopher G. Atkeson andStefan Schaal. Robot learning from demonstration. In theInternational Conference on Machine Learning, 1997.

[Girshick et al., 2014] Ross Girshick, Jeff Donahue, TrevorDarrell, and Jitendra Malik. Rich feature hierarchies foraccurate object detection and semantic segmentation. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2014.

[Kober et al., 2012] Jens Kober, Andreas Wilhelm, ErhanOztop, and Jan Peters. Reinforcement learning to ad-just parametrized motor primitives to new situations. Au-tonomous Robots, 33:361–379, 2012.

[Kober et al., 2013] J. Kober, J. Andrew Bagnell, and J. Pe-ters. Reinforcement learning in robotics: A survey. In-ternational Journal of Robotics Research, 32(11):1238–1274, 2013.

[Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever,and Geoffrey E. Hinton. ImageNet classification with deepconvolutional neural networks. In Advances in Neural In-formation Processing Systems 25, pages 1097–1105. Cur-ran Associates, Inc., 2012.

[Levine and Koltun, 2013] Sergey Levine and VladlenKoltun. Variational policy search via trajectory opti-mization. In Advances in Neural Information ProcessingSystems 26, pages 207–215. Curran Associates, Inc.,2013.

[Levine et al., 2015a] S. Levine, C. Finn, T. Darrell, andP. Abbeel. End-to-end training of deep visuomotor poli-cies. ArXiv e-prints, (arXiv:1504.00702), April 2015.

[Levine et al., 2015b] Sergey Levine, Nolan Wagener, andPieter Abbeel. Learning contact-rich manipulation skillswith guided policy search. In the IEEE International Con-ference on Robotics and Automation (ICRA’15), 2015.

[Magtanong et al., 2012] Emarc Magtanong, Akihiko Yam-aguchi, Kentaro Takemura, Jun Takamatsu, and TsukasaOgasawara. Inverse kinematics solver for android faceswith elastic skin. In Latest Advances in Robot Kinematics,pages 181–188, Innsbruck, Austria, 2012.

[Mayne, 1966] David Mayne. A second-order gradientmethod for determining optimal trajectories of non-lineardiscrete-time systems. International Journal of Control,3(1):85–95, 1966.

https://youtu.be/aM3hE1J5W98

http://www.ode.org/

[Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu,David Silver, et al. Human-level control through deep re-inforcement learning. Nature, 518(7540):529–533, 2015.

[Ng et al., 1999] Andrew Y. Ng, Daishi Harada, and Stu-art Russell. Policy invariance under reward transforma-tions: Theory and application to reward shaping. In theSixteenth International Conference on Machine Learning,pages 278–287, 1999.

[Pan and Theodorou, 2014] Yunpeng Pan and EvangelosTheodorou. Probabilistic differential dynamic program-ming. In Advances in Neural Information Processing Sys-tems 27, pages 1907–1915. Curran Associates, Inc., 2014.

[Pinto and Gupta, 2016] Lerrel Joseph Pinto and AbhinavGupta. Supersizing self-supervision: Learning to graspfrom 50k tries and 700 robot hours. In the IEEE Interna-tional Conference on Robotics and Automation (ICRA’16),2016.

[Silver et al., 2016] David Silver, Aja Huang, Chris J. Mad-dison, et al. Mastering the game of go with deep neu-ral networks and tree search. Nature, 529(7587):484–489,2016.

[Tassa and Todorov, 2011] Y. Tassa and E. Todorov. High-order local dynamic programming. In Adaptive Dy-namic Programming And Reinforcement Learning (AD-PRL), 2011 IEEE Symposium on, pages 70–75, 2011.

[Tenorio-Gonzalez et al., 2010] Ana Tenorio-Gonzalez, Ed-uardo Morales, and Luis Villasenor Pineda. Dynamic re-ward shaping: Training a robot by voice. In Advances inArtificial Intelligence – IBERAMIA 2010, volume 6433 ofLecture Notes in Computer Science, pages 483–492. 2010.

[Toshev and Szegedy, 2014] A. Toshev and C. Szegedy.DeepPose: Human pose estimation via deep neural net-works. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages1653–1660, 2014.

[Yamaguchi and Atkeson, 2015] Akihiko Yamaguchi andChristopher G. Atkeson. Differential dynamic program-ming with temporally decomposed dynamics. In the 15thIEEE-RAS International Conference on Humanoid Robots(Humanoids’15), 2015.

[Yamaguchi and Atkeson, 2016a] Akihiko Yamaguchi andChristopher G. Atkeson. First pouring with Baxter.https://youtu.be/NIn-mCZ-h_g, 2016. [Online;accessed April-15-2016].

[Yamaguchi and Atkeson, 2016b] Akihiko Yamaguchi andChristopher G. Atkeson. Neural networks and differentialdynamic programming for reinforcement learning prob-lems. In the IEEE International Conference on Roboticsand Automation (ICRA’16), 2016.

[Yamaguchi et al., 2015] Akihiko Yamaguchi, Christo-pher G. Atkeson, and Tsukasa Ogasawara. Pouringskills with planning and learning modeled from humandemonstrations. International Journal of HumanoidRobotics, 12(3):1550030, 2015.

https://youtu.be/NIn-mCZ-h_g

Model-based Reinforcement Learning with Neural Networks on...

Documents

Transcript of Model-based Reinforcement Learning with Neural Networks on...