A MODULAR REINFORCEMENT LEARNING METHOD FOR … · A MODULAR REINFORCEMENT LEARNING METHOD FOR...

A MODULAR REINFORCEMENT LEARNING METHOD FOR

ADAPTABLE ROBOTIC ARMS

A Thesis

Presented By

Qiliang Chen

to

The Department of Industrial Engineering

in partial fulfillment of the requirements for the degree of

Master of

Science in

the field of

Data Analytics Engineering

Northeastern University

Boston, Massachusetts

May 2020

ii

ABSTRACT

The vision of Industry 4.0 is to materialize the notion of lot-size of one through enhanced

adaptability of manufacturing and logistics operations to dynamic changes or deviations

on shop floors. Currently almost all industrial robots can only perform rote and repetitive

tasks in highly structured environment. Recent advances in meta reinforcement learning

and multi-task learning have the potential to successfully enable robots to adapt to a series

of highly-interrelated tasks by leveraging the prior knowledge, which increases the sample

efficiency. However, an assumption about degree of similarity within task sets must be

strictly obeyed, which is challenging when facing real life problems. Motivated by this

vital gap, this thesis develops a modular reinforcement learning framework to enhance the

efficient transfer of control policies from previously learned tasks. The proposed

framework contains three steps: modularization, modular reinforcement learning, and

modular composition. The experiments on the OpenAI Gym Robotics environments Reach,

Push, and Pick-and-Place indicate an average of 75% reduction in the number of iterations

to achieve 60% success rate, compared to the deep deterministic policy gradient (DDPG)

algorithm as a baseline. The significant improvements in jumpstart and asymptotic

performance of the agent creates promising opportunity to solve the current limitations of

industrial robot associated with sample-inefficiency and narrow task range through task

modularization and transfer learning.

iii

ACKNOWLEDGMENTS

Foremost, I would like to express my sincere gratitude to my advisor Prof. Mohsen

Moghaddam for the continuous support of my Master’s study and research, for his patience,

motivation, enthusiasm, and immense knowledge. His guidance helped me in all the time

of research and writing of this thesis. I could not have imagined having a better advisor and

mentor for my Master’s study.

Besides my advisor, I would like to thank the rest of my thesis committee: Prof.

Babak Heydari, for his encouragement, insightful comments, and hard questions.

Last but not the least, I would like to thank my family: my parents Weixin Chen

and Xiangmei Zeng, for giving birth to me at the first place and supporting me spiritually

throughout my life.

iv

TABLE OF CONTENT

1. ABSTRACT ....................................................................................................................ii

2. Introduction ................................................................................................................... 1

2.1 Background ............................................................................................................... 1

2.2 Problem setting and framework overview ................................................................. 3

3. Related work .................................................................................................................. 4

3.1 Reinforcement Learning ............................................................................................ 4

3.2 Meta Learning ........................................................................................................... 7

3.3 Multi-task Learning ................................................................................................... 8

3.4 Modularity ............................................................................................................... 10

4. Framework .................................................................................................................. 11

4.1 Modularization and module composition ................................................................ 12

4.2 Modular DDPG Reinforcement learning ................................................................ 13

5. Experiment and results ............................................................................................... 16

5.1 Experiment design ................................................................................................... 16

5.2 Experiment results ................................................................................................... 17

5.3 Discussion ............................................................................................................... 19

6. Conclusion ................................................................................................................... 20

7. REFERENCE .............................................................................................................. 23

v

LIST OF FIGURES

Figure 1: Overview of the proposed framework ................................................................ 4

Figure 2: Algorithm 1-- Modular-DDPG......................................................................... 15

Figure 3: Experiments setup and task modularization on the OpenAI Gym environments:

(a) Reach a randomly chosen position in 3D space. (b) Push an object to a randomly chosen

position in 2D space (tabletop). (c) Pick-and-Place to a randomly chosen position in 3D

space. ................................................................................................................................. 17

Figure 4: Preliminary results on Scenarios 1 ................................................................... 18

Figure 5: Preliminary results on Scenarios 2 ................................................................... 18

1

2. Introduction

2.1 Motivation Smart manufacturing is an emerging form of production integrating manufacturing

assets of today and tomorrow with sensors, computing platforms, communication

technology, control, simulation, data intensive modelling and predictive engineering

(Kusiak, 2018). One of the most important component of it is advanced robot owing to

their versatility and extensive use in a wide range of applications such as assembly, welding,

painting, packaging, labeling, and inspection, among others. Also, inspired by Industry

4.0’s goal of constructing smart factory (Monostori et al., 2016) (Lasi et al., 2014)

(Hofmann and Rüsch, 2017) (Lu, 2017) (Zhong et al., 2017), robot need to have the ability

to solve problems and make decisions independently of people, further possess the

flexibility to learn new skills faster and efficiently (Wang et al., 2019). However, the

current situation is almost all the real-world industry robots in manufacturing factory can

only solve the repetitive tasks under strictly structured-environment, which cannot fulfill

the requirement of real life problems. For example, in piece picking process, we require

the robots to pick items and place them on specific target place with order, under

unstructured environment.

The rapid development of Reinforcement learning (RL) help the robots to learn

some skills with little interference of human. For example, industrial robot can learn how

to insert a peg to a hole or Pick and Place object to a target place through RL with visual

sensor. The problem is RL can only help robot learn a single skill, when faced very similar

new tasks, like the change of the shape, rotation or background, the old knowledge will

have a bad performance. Comparing to human, we have a very strong ability to use our

prior knowledge to adapt to a similar new task. For example, people who have learnt how

to ride bicycle can master riding motorcycle very easily and faster, plenty examples exist.

This significant gap about robot how to transfer the prior knowledge to help mastering a

similar new task attracts research interest of many researchers and institutions, such as

Deep Mind (J. X. Wang et al., 2016) (Gupta et al., 2018) (Ritter et al., 2018) (Botvinick et

al., 2019), OpenAI and Siemens (Levine et al., 2016) (Duan et al., 2016) (Tamar et al.,

2

2016) (Vinyals et al., 2019). Thus, the research about improving the adaptability of robot

is very meaningful, which is also key step of leading to smart factory.

Comparing to human, DRL still has deficiencies in two aspects (J. X. Wang et al.,

2016). First, DRL is sample-inefficient, while human can attain reasonable performance of

wide range of tasks with comparatively little experience. Second, DRL can only specialize

in a narrow range of tasks, while human can flexibly adapt to changing of tasks. A lot of

recent researches focus on solving the first challenge. Meta-learning is an emerging topic

in artificial intelligence recently. The key idea of it is learning to learn (J. X. Wang et al.,

2016), which means the algorithm aims to conclude a method about how to learn a new

task from the experience within some interrelated tasks, so it can learn a new task after few

epochs with few examples, faster and more efficiently. Enhanced Meta learning with

episodic memory can refer memory of old history to help make quick and better estimation

of state value or decision, when facing similar state. Multi-Task Learning (MTL) aims to

let the agent learn several skills together, which can leverage useful information contained

in multiple related but not identical tasks to help improve the generalization performance

of all the tasks (Zhang and Yang, 2017). An extensive review about reinforcement learning,

Meta learning with episodic memory and Multi-Task Learning will be provided in Section

3.

Although Meta learning has a promising result about solving the first challenge of

RL (sample-inefficient), it has a strict assumption about the similarity degree within task

set. As pointed out by (Vinyals et al., 2019), current Meta RL methods are still limited to

very narrow task distributions that allow slight parametric difference between individual

tasks. When human facing unfamiliar tasks, we still can transfer some knowledge from

previous experience. For example, people who always play pokers can learn a new poker

game faster than who has no experience, because they can transfer the knowledge relate to

cards itself to help solve the new game, like the meaning of each cards or skills about

estimating opponents’ cards, so what they need to learn will be mainly the new rule.

Inspired by this phenomenon, we take a high-level combinatorial generalization approach

based on the notion of modularity (Devin et al., 2017) (Alet, Lozano-Pérez and Kaelbling,

2018) to address the second limitation of deep RL. As argued by Herbert Simon (Simon,

1991), modularity is the ubiquitous adaptability mechanism through which most biological,

3

social, economic, and software systems manage complexity in highly unstructured

environments (Nolfi, 1997) (Baldwin and Clark, 2000) (Sullivan et al., 2001) (Gianetto

and Heydari, 2015) (Heydari and Dalili, 2015). The implication of modular design in the

context of deep RL is simple: Although learning a single policy that performs optimally

across all tasks is non-optimal (Devin et al., 2017), learning and mixing-and-matching

simpler policies for sufficiently small and overlapping task modules may solve the second

aforementioned problem of current deep RL methods (i.e., specialization on one task).

2.2 Problem Statement and Framework Overview In this thesis, we propose a modular RL framework based on the notion of task

modularity and transfer learning about the knowledge of similar modules to solve diverse

robotic tasks in manufacturing and logistics. Our goal is to tackle the following

fundamental research question: How can a robot autonomously adapt to complex and

unstructured manufacturing and logistics environments (e.g., assembly or piece-picking)

through automated transfer of its learned knowledge across parametrically- and non-

parametrically different tasks, using the notion of modularity? The framework is

comprised of three stages (see Figure 1):

1) Modularization: Deciding on the degree and architecture of modularity is one the

key decisions to make, governed by a range of parameters that characterize the

degree of spatial variations (e.g., distance between different tasks or environments)

as well as temporal uncertainties (e.g., unexpected obstacles on the way, or dynamic

environments) (Heydari, Mosleh and Dalili, 2016). The objective of this stage is to

decompose tasks into reusable modules for combinatorial generalization to a wide

range of new tasks.

2) Modular training: The goal of this stage is to train separate functions to represent

each module decomposed from first stage, which means several independent neural

network need to be generated. In the training process, if similar pre-trained modules

are detected, we can transfer the pre-trained parameters to new module, or we

initialize it randomly and use RL algorithm to train from scratch. Each network

parameters of new module will be recorded for future use. The idea is to re-use the

pre-trained modules efficiently to boost and facilitate the new modules training in

the future (Alet, Lozano-Pérez and Kaelbling, 2018).

4

3) Composition: This stage is built on the assumption that there exists a compositional

scheme (Alet, Lozano-Pérez and Kaelbling, 2018) that enables forming of any new

task from a set of modules. Thus, the goal of this stage is to learn the potential

relationship between each modules, and based on it to connect all modules together,

which then can solve the whole task.

The proposed framework has been validated on the Robotics environment of the

OpenAI Gym (Brockman et al., 2016), with deep deterministic policy gradient (DDPG)

(Lillicrap et al., 2019) serving as the deep RL model. Section 4 provides the details of the

framework, and Section 5 presents the experiments and results. Section 6 concludes the

thesis along with a summary of limitations and proposed directions for future research.

3. Related work

3.1 Reinforcement Learning

Recently, Reinforcement Learning (RL) has achieved incredible success in several

different areas. The first time Deep Reinforcement Learning(DRL) attracts all the

researchers’ focus in the world, is when computer using DRL algorithm to learn how to

play Atari games and reach the super-human performance level (Mnih et al., 2015). Then,

Task Module Repository

Learned Modules.

Task Modularization Building Blocks. Module Optimization.

Module Mapping & Transfer Learning Source Module Identification. Transfer Parameters.

Modular Reinforcement Learning Learning & Composition of Policies/Actions.

Environment Uncertainty Characterization

Module

Pre-trained? Random Initialization

Random Module Net Parameters.

Validation OpenAI Gym Robotics Environments

No

Yes

Expected New Tasks

Transfer

Update

Task Description &

Similarity Index

Figure 1: Overview of the proposed framework

5

a series of significant researches in Reinforcement learning come out. Go game is known

as the most challenging game in Artificial Intelligence for decades. In 2016 the agent

AlphaGo developed by DeepMind defeated the legendary Go game player Mr. Lee Sedol

with 4-1 using DRL algorithm (Silver et al., 2016), and the updated agent AlphaGo

Zero achieved superhuman performance, winning 100–0 against the champion-defeating

AlphaGo (Silver et al., 2017). StarCraft II is a real-time strategy video game with highly

complex environment and diverse options of decision. In 2019, the agent named AlphaStar

from DeepMind uses multi-agent reinforcement learning algorithm to master this game and

be rated above 99.8% of officially ranked human players (Vinyals et al., 2019). García and

Shafie, 2020 uses a safe RL algorithm to teach a humanoid robot to walk faster.

3.1.1 Value-based and Policy-based Reinforcement Learning

There are two types of reinforcement learning algorithms, model-free and model-

based. The difference between these two types is model-based algorithms will first learn a

representation of the whole environment, then plan a solution for this environment, while

model-free algorithms will not. Recently, model-free RL algorithms have been researched

extensively, one of the most basic but popular and powerful value-based model-free RL

algorithms is Q-learning (Mnih et al., 2015), it basically estimate the Q value function,

which represents the goodness of taking an action based on a specific states, and the policy

is based on choosing the action which maximize the Q value. However, original Q-learning

algorithm are very easy to overestimate Q value, which will influence the performance.

Instead of using the same values to select and to evaluate an action, Double Q-learning

decouple selection from evaluation by using two Q value functions (Van Hasselt, Guez and

Silver, 2016). In dueling DQN, it changes the basic Q-learning architecture into two

streams -- state value function and advantages of each action, and combine them together

to get final Q value function, which can better clarify that reward mainly comes from states

or actions (Z. Wang et al., 2016). Prioritized experience replay is a strategy to increase the

probability of samples from the experience replay, which have a larger change when

updating the parameters (Schaul et al., 2016). It is a general strategy, which can be applied

on different RL algorithms. There are also some policy-based RL methods, which updates

policy based on the gradient calculated from the reward by parameters (Sutton et al., 2000).

The above value-based methods are critic-only, and the policy based method is actor-only.

6

The actor-critic architecture aims at combining the strong points of both methods, critic

approximates a value function, which is then used to update the actor’s policy parameters

for performance improvements (Konda and Tsitsiklis, 2000). However, one of the biggest

drawbacks in original actor-critic is high variance, so learning an advantage function (value

function minus a baseline value) will prevent it, which is known as advantage actor-critic

method (Mnih et al., 2016). In order to make data collected from interaction with

environment more efficient, importance sampling can reuse the past experience to update

the policy (Degris, White and Sutton, 2012). Proximal policy optimization (PPO) algorithm

add a constraint on the difference between old policy and new policy when using

importance sampling to learn from previous experience, which can make learning process

much more stable (Schulman et al., 2017).

3.1.2 Continuous Action Space Reinforcement Learning

In the above problem setting, the action space is always discrete, like in the Atari

games, the actions may be “up”, “down”, “left”, “right” and “fire”, so output of policy is a

probability distribution of different action given a state, in the optimal policy we just

choose the one with largest probability. However, there are lots of real task whose action

space is continuous, like piece picking in our case, the action will be the angle or velocity

of joints of robotic arm. Deterministic Policy Gradient (DPG) use actor to map the states

to deterministic actions, rather than a probability distribution, so it can be applied to

continuous action space task (Silver et al., 2014). Deep Deterministic Policy Gradient

(DDPG) is trained off-policy with samples from a replay buffer. It also used a target Q

network to give consistent targets during temporal difference backups (Lillicrap et al.,

2019).

In our case, the reward signals are sparse and binary, that means the agent may keep

interacting with the environment without any positive reward, thus learning nothing. One

of the common methods to solve this problem is using reward shaping. With human prior

knowledge the agent will be guided to reach the final goal step by step. Like the agent

developed by OpenAI learn to play DOTA2 with guidance like receiving negative reward

when character dying or positive reward when collecting resource, while the final goal is

winning the game (OpenAI et al., 2019). However in some tasks, it is very hard to design

7

a good reward shaping strategy. Hindsight Experience Replay is known as learning from

failure, it creates some fake goals in each trail. So even the agent does not reach the real

goal, it can still receive some reward to learn something from every experience

(Andrychowicz et al., 2017). DDPG + HER have a pretty good performance on OpenAI

gym robotic environment, so in our experiments, we use this algorithm combination as a

benchmark to evaluate the performance of our method.

3.2 Meta Learning Although Reinforcement learning has achieved a super-human level performance

in single task, DRL still has deficiencies in two aspects comparing to human (J. X. Wang

et al., 2016). First, DRL is sample-inefficient, while human can attain reasonable

performance of wide range of tasks with comparatively little experience. Second, DRL can

only specialize in a narrow range of tasks, while human can flexibly adapt to changing of

tasks. A lot of recent researches focus on solving the first challenge. The reasons of

slowness of current reinforcement learning algorithms are from two aspects: weak

inductive bias and incremental parameter adjustment (Botvinick et al., 2019). So there are

two kinds of methods focus on solving this problem: Meta-RL and Episodic RL.

Meta-Learning is known as learning to learn, which comes from psychology

(Harlow, 1949). And this idea can also be applied to RL situations. Basically, it means

when agent faces a new environment or task, it can leverage the prior knowledge (inductive

bias) learned from previous tasks to master the new one faster. For example, people who

know how to ride bicycle will learn how to ride motorcycle faster than who has no previous

experience. This idea can be implemented in several ways. One way is to use recurrent

neural network (RNN) and train it on a series of tasks come from same distribution, to

maximize the total reward on all the tasks. Since it will take all the previous information

into account, it can learn common knowledge across tasks, which will let it solve a new

task faster (Duan et al., 2016) (J. X. Wang et al., 2016). A simple but powerful algorithm

named MAML (Model-Agnostic Meta-Learning) try to find a good initial parameters

which will have the best performance after one update from it, on a series of tasks (Finn,

Abbeel and Levine, 2017). Mishra et al., 2018 combined temporal convolution layers with

causal attention layers, which avoided the exponentially increasing number of layers when

using RNN as meta-learner. Some researches focus on using Meta-RL to learn an

8

exploration strategy. Gupta et al., 2018 add a latent state to the input of the policy which

is a Gaussian distribution, and use MAML to learn a good latent state, which then improve

the exploration process. Experiments executed on 50 robotics environment in (Yu et al.,

2019) set up a benchmark for current multi-task and meta-learning algorithm, which give

us a good reference to evaluate our model in the future.

Episodic RL can record the situations the agent encountered previously including

the states and the action to take, so when facing a new situation, the agent will go back to

the memory and find the most similar situation, then take the associate action (Pritzel et al.,

2017). By doing that, the agent can take behavior immediately without long time fine tune

for the parameters. Ritter et al., 2018 try to combine episodic memory with Meta-RL

architecture, the proposed model epL2RL outperform the model without episodic memory

a lot in several experiments.

Transfer learning try to transfer the prior knowledge from an old task to a new task,

to help agent mastering new task faster. The ideas of Meta-RL and Episodic RL are very

similar to transfer learning. They also try to use the prior knowledge or memory to adapt

to a new task efficiently. So we can use some metrics in transfer learning to quantify the

performance of Meta-RL algorithms, and we also use some of them in our experiments to

evaluate our model. There are several metrics available to use: jumpstart, asymptotic

performance, total reward, transfer ratio, time to threshold (Taylor and Stone, 2009).

3.3 Multi-task Learning

Multi-task learning (MTL) is known as learning several related tasks together, by

leveraging the useful information to help improve the generalization performance of all the

tasks. For example, when human learn to ride bicycle and tricycle together, the experience

in learning to ride a bicycle can be utilized in riding a tricycle and vice versa (Zhang and

Yang, 2017). The mechanism seems to be very similar to transfer learning (Weiss,

Khoshgoftaar and Wang, 2016), except for the objective difference. MTL aims to improve

the performance over all the tasks, while transfer focus on the improvements of the target

tasks. Meta learning aims to learn a common learning method by training among a series

of related tasks, so when facing a new task, it can adapt to it with few samples and short

training process. Unlike Meta learning, MTL aims to learn the tasks themselves, so a

trained MTL model can solve tasks without future adaptation. Because of the main idea

9

about leveraging shared knowledge between tasks, it is a good solution to solve the sample-

inefficient problem in Deep learning (Caruana, 1997), especially when the labeled data is

hard to collect, like medical data. Recent research in MTL has been successful across all

applications in machine learning, from natural language processing (Collobert and Weston,

2008) and speech recognition (Deng, Hinton and Kingsbury, 2013) to computer vision

(Girshick, 2015) and drug discovery (Ramsundar et al., 2015). Since our focus is in RL

problem setting, so we mainly review some achievements of MTL in solving RL problems

in details.

One of the promising benefits in MTL is the shared parameters can help improve

the performance among all the related tasks. However, in practice, shared knowledge

between different tasks can interfere negatively, which lead to unstable low-efficiency

training. (Teh et al., 2017) proposed a novel approach named “Distral” for joint training of

multiple tasks. The main idea is instead of sharing parameters between different tasks

directly, all the tasks share a “distilled” policy that captures common behavior across tasks.

Agent is forced to solve the specific task without going too far from distilled policy, which

helps training become much more robust and efficient. A general issue in MTL is the

imbalanced resource distribution among the tasks. In other words, some tasks appear more

salient to the learning process, for instance because of the density or magnitude of the in-

task rewards, which result in the loss of generality. (Hessel et al., 2019) propose a method

which can automatically adapt the contribution of each task to the agent’s updates, so that

all tasks have a similar impact on the learning dynamics. The proposed method can help

one single policy to master 57 diverse Atari games with super-human performance.

(Deisenroth et al., 2014) used policy-search method to multitask learning for robots with

stationary dynamics. The key idea is to explicitly parametrize by tasks. Thus, enable the

policy to generalize from training tasks to similar, but unknown, tasks at test time. This

generalization is phrased as an optimization problem, which can be solved jointly with

learning the policy parameters. Meta world (Yu et al., 2019) presents the performance of

state of the art MTL algorithms on 50 environments. The result shows in scenario with 10

tasks, the agent can reach 80% success rate, while in scenario with 50 tasks, the agent can

only reach 48% success rate, which shows that with the complexity of high level goal

increase(the number of tasks in MTL), the performance of MTL agent decrease

10

significantly. Meta- world results of MTL algorithm is an important benchmark for future

research as a reference.

3.4 Modularity

Modularity has long been recognized as an effective adaptability mechanism and

has been an active area of research in a wide range of different academic disciplines, since

the influential work by Herbert Simon (Simon, 1991). Simon argues that near-

decomposability enables systems to respond effectively to external changes without

disrupting the system as a whole. Modularity has also been recognized as an essential

concept in architecting engineering products, processes, and organizations. It has been

shown to increase product and organizational variety (Eppinger and Ulrich 2015), the rate

of technological and social innovation (Baldwin and Clark, 2000), market dominance

through interface capture (Moore, Louviere and Verma, 1999), cooperation and trust in

networked systems (Gianetto and Heydari, 2015) (Mosleh and Heydari, 2017).

The ubiquity of modularity in various complex systems as an adaptability

mechanism makes it a suitable candidate for incorporating it within RL algorithms to

achieve the adaptability that can enable efficient transfer learning across different tasks

with both parametric and non-parametric variations. This, in fact, has been a goal for some

AI researchers for decades. Leveraging modularity to scale up RL capabilities dates back

to the early 90s (Wixson, 1991) (Uchibe, Asada and Hosoda, 1996), following and enabled

by the development of the Q-learning algorithm. However, modularity has been used with

different meanings and approaches in AI since then and it is important to distinguish

between different modes that modularity has been used in meta-learning. With some

simplifications, previous works in modular meta-learning can be divided into two general

category, based on their overall approach towards modularity. Here we briefly discuss

these two approaches.

The first approach, dominant in the pre-deep RL era, is based on hierarchical

learning (Barto and Mahadevan, 2003) in which most frameworks consist of two separate

mechanisms for task decomposition (Singh, 1992) followed by behavior coordination

(Uchibe, Asada and Hosoda, 1996). In this approach, different modules act as different

action voters. That is, each module observes the action taken by the agent, the state

transition, and a reward signal specific to the module. At each time step, the agent combines

11

the action preferences of the modules to compute a joint policy (Russell and Zimdars, 2003)

(Sprague and Ballard, 2003) (Simpkins and Isbell, 2019). More recently, Frans et al., 2018

applied hierarchical RL using deep neural networks by pre-training a pool of common

neural network sub-policies across tasks and then using a task specific master-policy

neural network to select the appropriate sub-policy.

Unlike the first approach which considers modules as distinct (goal-specific) policy

agents whose action recommendations need to be aggregated for each task, the second

approach directly applies modularity to the deep neural network architecture. Modules are

reusable neural network functions that are pre-trained and can then be recombined in

different ways to undertake new tasks. The basic idea in this approach is that rather than

training a single network on a large number of training data, one can simultaneously train

a large number of different networks, while tying their parameters together, which in the

end generates a set of reusable neural network modules. This idea has recently been applied

to a variety of applications such as real-world reasoning problems (Andreas et al., 2016),

task and motion planning (Chitnis, Kaelbling and Lozano-Perez, 2019), and robotics

(Devin et al., 2017). In the latter, authors show that neural network policies can be

decomposed into task-specific and robot specific modules where the former is shared

across robots and the latter is shared across tasks. Our work builds on Alet, Lozano-Pérez

and Kaelbling, 2018 in which the authors use a set of neural network modules for a set of

basic functions that can be re-tuned and recombined using adaptive structures in the face

of new tasks. Alet, Lozano-Pérez and Kaelbling, 2018, however, formulate and apply this

method to a set of supervised robotic problems and do not extend it to RL.

4. Framework As discussed in the background section, deep RL has achieved recent remarkable

success in reaching human-level performance in tasks with discrete and low-dimensional

action-spaces such as playing Atari via the DQN algorithm (Mnih et al., 2015). Several

algorithms have been introduced in recent years for dealing with continuous, high-

dimensional action-spaces among which the DDPG algorithm Lillicrap et al., 2016 has

demonstrated great success in accomplishing relatively simple tasks such as pendulum,

cartpole swing up, or puck shooting. For more complex, continuous and high-dimensional

12

action-space tasks such as robotic assembly or piece-picking, however, current off-policy

RL algorithms such as DDPG may not be directly applicable (see, e.g., (Andrychowicz et

al., 2017, Vecerik et al., 2019)) due to their inherent complexity. We tackle this problem

through task modularization and module composition in order to enable the transfer of

policies across different tasks with non-parametric variations (Vinyals et al., 2019).

Without loss of generality, we implement this notion through a modular DDPG (M-DDPG)

algorithm and test it on three OpenAI Gym Robotics environments. Details of the proposed

framework are presented next.

4.1 Modularization and Module Composition

Deciding on the degree and architecture of modularity is one the key decisions to

make, governed by a range of parameters that characterize the degree of spatial variations

(e.g., distance between different tasks or environments) as well as temporal uncertainties

(e.g., unexpected obstacles on the way, or dynamic environments) (Heydari, Mosleh and

Dalili, 2016). In this context, the objective of modularization is to decompose tasks

into reusable modules for combinatorial generalization to a wide range of new tasks. We

present a formalism of the modularization process here and leave the

development of a methodology for automated identification of task module for

future research. The underlying assumption of task modularization is that there exists

a composition function (Alet, Lozano-Pérez and Kaelbling, 2018) for forming any new

task from a set of modules. After learning the composition function, we can decompose

the new task into several modules following the relationship we have learned between

each other. Assuming we can train each module by using the proposed M-DDPG

algorithm, we can do composition based on pre-trained modules, to form a whole

function, which will solve the new task. Thus, the key idea here is to learn the potential

composition function.

The structure of Graph Neural Network (GNN) naturally supports combinatorial

generalization because they do not perform computations strictly at the system level, but

also apply shared computations across the entities and across the relations as well

(Battaglia et al., 2018). The idea behind GNN is “infinite use of finite means” (Peter and

Chomsky, 1968), represents the re-use of same components in different compositional

structure can result in various effects. Thus we apply a GNN to fulfill our requirement

of learning the composition function referring to the structure setting in Alet et al.,

2019. A graph is

13

defined as a 3-tuple 𝐺𝐺 = (𝑢𝑢,𝑉𝑉,𝐸𝐸), The u is a global attribute, the 𝑉𝑉 = {𝑣𝑣𝑖𝑖}𝑖𝑖=1:Nv is the

set of nodes (of cardinality 𝑁𝑁𝑣𝑣), 𝐸𝐸 = {(𝑒𝑒𝑘𝑘 , 𝑟𝑟𝑘𝑘 , 𝑠𝑠𝑘𝑘)}𝑘𝑘=1:𝑁𝑁𝑒𝑒 is the set of edges (of cardinality

𝑁𝑁𝑒𝑒), where each 𝑒𝑒𝑘𝑘 is the edge’s attribute, r𝑘𝑘is the index of the receiver node, and 𝑠𝑠𝑘𝑘 is the

index of the sender node (Alet et al., 2019). In our case, each node represents a module

decomposed from the task, each edge between two nodes represents the relationship

between two corresponding modules. Graph is a directed, when constructing edge, switch

between sender and receiver will make a difference.

4.2 Modular DDPG Reinforcement Learning

GNN will help us solve the modularization and composition steps efficiently. It

also allows us to re-use the pre-trained modules to initialize the new ones, which can have

a jumpstart when facing the new task. Thus, when using the proposed M-DDPG to train

new modules, we will first check the similar modules existed in the module repository and

transfer the prior knowledge to the new module; if not, initialize it randomly and train it

from scratch. When facing the second situation, we proposed a novel algorithm named

Modular DDPG to solve the training process.

A standard DDPG setup with fully observed environments (Hou et al., 2017) is

considered for each task module. Let 𝑓𝑓 and 𝜃𝜃 denote the set of neural networks and the

respective parameters representing a task module. 𝑓𝑓 comprises four neural networks

including the actor's policy network, the critic's Q network, the actor's target policy network,

and the critic's target Q network, with network parameters denoted by 𝜃𝜃 =

(𝜃𝜃𝜇𝜇; 𝜃𝜃𝑄𝑄; 𝜃𝜃𝜇𝜇′; 𝜃𝜃𝑄𝑄′), in that order. 𝑄𝑄(𝑠𝑠, 𝑎𝑎) is the state-action pair value and determined

policy 𝜇𝜇(s) = 𝑎𝑎. For each given module of a new task, the actor and critic networks are

initialized with the parameters of the most related source module denoted by 𝜃𝜃∗𝜇𝜇 and 𝜃𝜃∗

𝑄𝑄.

Note that both set of parameters are initialized randomly, if no such pre-learned module

exists. The critic's loss function is thus calculated as 𝐿𝐿 = 1/𝑁𝑁∑ [𝑦𝑦𝑖𝑖 − 𝑄𝑄(𝑠𝑠𝑖𝑖 ,𝑎𝑎𝑖𝑖|𝜃𝜃𝑄𝑄)]2𝑖𝑖 ,

where 𝜃𝜃𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑄𝑄 ← 𝜃𝜃∗

𝑄𝑄 , 𝑦𝑦𝑖𝑖 = 𝑟𝑟𝑖𝑖 + 𝛾𝛾𝑄𝑄′(𝑠𝑠𝑖𝑖+1, 𝜇𝜇′(𝑠𝑠𝑖𝑖+1|𝜃𝜃𝜇𝜇′)|𝜃𝜃𝑄𝑄′) , 𝑁𝑁 is the size of the

experiences mini-batch sampled from the replay buffer, and γ is a discount factor. The

actor's policy parameters over the same mini batch of size N is calculated as ∇𝜃𝜃𝜇𝜇𝐽𝐽(𝜃𝜃) =

1/𝑁𝑁∑ [∇𝑎𝑎𝑄𝑄(𝑠𝑠,𝑎𝑎|𝜃𝜃𝑄𝑄)|𝑠𝑠=𝑠𝑠𝑖𝑖,𝑎𝑎=𝜇𝜇(𝑠𝑠𝑖𝑖)∇𝜃𝜃𝜇𝜇𝜇𝜇(𝑠𝑠|𝜃𝜃𝜇𝜇)|𝑠𝑠=𝑠𝑠𝑖𝑖]𝑖𝑖 , where 𝜃𝜃𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝜇𝜇 ← 𝜃𝜃∗

𝜇𝜇.

14

The notion of soft updates (Mnih et al., 2015) (Lillicrap et al., 2016) is applied for

updating the actor and critic target networks as 𝜃𝜃𝜇𝜇′ ← 𝜏𝜏𝜃𝜃𝜇𝜇 + (1 − 𝜏𝜏)𝜃𝜃𝜇𝜇′ and 𝜃𝜃𝑄𝑄′ ←

𝜏𝜏𝜃𝜃𝑄𝑄 + (1 − 𝜏𝜏)𝜃𝜃𝑄𝑄′ respectively (𝜏𝜏 ≪ 1). This is accompanied by one-the-y explorations

based on the Ornstein-Uhlenbeck process (Uhlenbeck and Ornstein, 1930). The calculated

state (𝑠𝑠𝑖𝑖) and reward (𝑟𝑟𝑖𝑖) together with the actions (𝑎𝑎𝑖𝑖) are then utilized to accumulate a

replay buffer storing sampled experiences as (𝑠𝑠𝑖𝑖;𝑎𝑎𝑖𝑖; 𝑟𝑟𝑖𝑖; 𝑠𝑠𝑖𝑖+1) in a finite-size memory for

updating neural network parameters. In standard RL, reward is typically assigned from

(-1 0) implying missing and achieving a goal g, respectively. The pseudocode of M-DDPG

is presented in Algorithm 1(see Figure 2).

When facing high-level and complex tasks like piece piking, it is very difficult for

the agent to finish the task in the early stage when the policy is very poor. In other words,

the reward is too sparse so the agent can learn nothing because no matter what actions the

agent take, it can only receive negative reward, which preventing the agent from figuring

out which action is better. Hindsight replay experience (HER) is a strategy to boost the

training process when the goal is sparse (Andrychowicz et al., 2017). It creates fake goals

(typically the end state after each action) for the agent in every time step. The episodes are

therefore reexamined with a number of fake goals resulting in a replay buffer associated

with the original and the fake goals. The replay buffer is used for sampling mini-batches

to update the actor and critic networks. Thus, the agent can always learn something even

from mistake. Logically, the fake goals represent kinds of intermediate goals to guide the

agent approach the real goal gradually, due to the strong ability of generalization of neural

networks. In the thesis, we HER to enhance the effect of M-DDPG, which boosts the

training process and results in a better performance.

15

Figure 3: Algorithm 1-- Modular-DDPG

16

5. Experiment and Results This section evaluates the performance of the proposed framework on a set of

Robotics environments on the OpenAI Gym (Brockman et al., 2016). The OpenAI Gym

provides various virtual standard benchmark environments for RL research and has

contributed significantly to recent advances in AI research, as evident by the high

percentage of publications that use such environments for validation and benchmarking.

The existing Robotics environments allow experiments on a set of virtual goal-oriented

tasks including reaching an object positioned randomly in 3D space, pushing/sliding an

object to a goal position in 2D space, and picking an object and placing it on a random

position in 3D space. The Gym environments are all enabled via the MuJoCo physics

simulator (Todorov, Erez and Tassa, 2012).

5.1 Experiment Design

Each module neural network is trained as a DDPG (Lillicrap et al., 2016) with HER

(Andrychowicz et al., 2017) for reward shaping (see Algorithm 1). All four deep neural

networks we are using (i.e., training actor network, target actor network, training critic

network, and target critic network) comprise three fully-connected hidden layers with 256

neurons in each layer. Each training epoch is comprised of 50 cycles each cycle with two

rollouts. In each rollout, the agent interacts with the environment for 50 steps. The network

updates parameters 40 times after 50 steps interaction during each rollout. Other hyper-

parameters settings are similar to (Andrychowicz et al., 2017). For training using M-DDPG,

we first pre-train the networks on a set of tasks, and then utilize the pre-trained networks

for initializing the similar modules of a new task as explained in Section 3. We then allow

the agent to interact with the new environment and update the network parameters. The

experiments were executed on a computer with CPU AMD Ryzen Threadripper 2970WX

24-Core Processor 3.00GHZ, on which every epoch needed an average of 70 seconds to

run. The experiments are designed to compare the performance of M-DDPG against DDPG

as a baseline with respect to jumpstart and asymptotic performance, as suggested by Taylor

and Stone, 2009.

17

5.2 Experimental Results

The proposed framework, specifically M-DDPG, was implemented and tested on

three Robotics environments of the OpenAI Gym. Task modularization was conducted by

decomposing the initial tasks into sub-tasks associated with before and after reaching the

object (see Figure 3). For example, consider Reach and Push as the source tasks already

learned, and Pick-and-Place as the target task to be learned. Now let us define three source

task modules: (A) Reach, (B) Push before reaching the object, and (C) Push after reaching

the object. Let us also define two target task modules: (D) Pick-and-Place before reaching

the object, (E) Pick-and-Place after picking the object up. A potential mapping for this

scenario would therefore be 𝐷𝐷 → 𝐵𝐵 or 𝐷𝐷 → 𝐴𝐴. The experiments have been conducted on

two scenarios:

• Scenario 1 (Figure 4). Transfer the reaching module of the Pick-and-Place task from

(i) the Reach task, or (ii) the reaching module of the Push task. Baseline: DDPG

(no transfer).

• Scenario 2 (Figure 5). Transfer the reaching module of the Push task from (i) the

Reach task, or (ii) the reaching module of the Pick and-Place task. Baseline: DDPG

(no transfer).

Figure 4: Experiments setup and task modularization on the OpenAI Gym environments: (a) Reach a randomly chosen position in 3D space. (b) Push an object to a randomly chosen position in 2D space (tabletop). (c) Pick-and-Place to a randomly chosen position in 3D space.

18

Figure 5: Preliminary results on Scenarios 1

Figure 6: Preliminary results on Scenarios 2

19

Results also indicate the importance of selecting ‘the right’ source task: in Reach,

the agent is rewarded for reaching the object regardless of the ‘side’ reached, while in Push

and Pick-and-Place, the agent may receive no reward even when it reaches the object if it

is on the wrong side. In the graph, X-axis is the number of epochs that the agent has been

trained, the Y-axis is the success rate of the performance on testing process, and we execute

100-round tests on each epoch. To avoid the influence of noise and randomness, agent in

each environment setup is trained for 5 times. The real lines represent the mean

performance of each environment setting, the shadow represents the variance of it.

The results show that the proposed framework significantly outperforms the

baseline DDPG in boosting both the jumpstart and asymptotic performance of the robot

(Taylor and Stone, 2009). That is, the number of iterations to achieve 60% success rate has

been reduced by over 80% in Scenarios 1 and over 70% in Scenarios 2. Further, the agent

converges much faster in both scenarios by M-DDPG compared to the baseline DDPG.

Moreover, the agent is able to fully learn the Pick-and-Place task (transfer from both Reach

and Push), while the baseline DDPG is unable to advance beyond 70% success rate on

average. One interesting observation is that transfer from Reach yields slightly lower

performance in Scenario 2--transfer from PNP (see Figure 5.3). We speculate the

underlying reason to be that in Reach, the agent is rewarded for reaching the object

regardless of what side of the object is has reached; while in Push the agent may receive

no reward even when it reaches the object if it is on the wrong side. This behavior is not

much evident in Scenario 1, which may be due to the fact that the Pick-and-Place agent

must reach the center of the object not a specific side of it. Another observation is that the

success rate of the agent has relatively lower variance under M-DDPG. We speculate that

this behavior is due to the introduction of an inductive bias (Botvinick et al., 2019) through

transferring from similar, previously learned task modules.

5.3 Discussion

The results shown on the graph demonstrate that our proposed framework can boost

the training process and reach a much better performance than baseline algorithm DDPG.

However, there are some limitations of the experiments, which needs further research.

First, the tasks and the environments used in this thesis are simple and

straightforward. Both the Push task and Pick-and-Place task can be decomposed into only

20

two modules, the same as composition process. The boundary between these two modules

is very clear, which is the moment when robot gripper touched the object. Also, the

relationship between these two modules is easily to be discovered, the constructed graph

based on them contains two nodes and only one directed edge, pointed from reach module

to Push/PNP module. However, in real life problems, the number of modules decomposed

from a task can be much larger, and the potential relationship behind modules will be

diverse, including single module, compositional structure, weighted ensemble and general

function-composition tree (Alet, Lozano-Pérez and Kaelbling, 2018). Thus, further

evaluation of the proposed framework needs to be done on more difficult tasks, which

contain more modules with complex inner structure, to demonstrate the efficiency of

modularization and composition steps.

Second, in this thesis, we assume the result of similarity checking between new

modules and pre-trained modules in module repository is binary and unique, the same or

totally different. So when finding out the target pre-trained module, we will initialize the

exact parameters of it on the new module, which is like a hard copy style. However, the

interesting observation between transferring from Reach and transferring from PNP in

Scenario 2, reflects the situation that there may exist several similar pre-trained modules

in module repository, and their degree of similarity is different. Thus, several new

questions are posted related to transfer between task modules: Which kind of metrics can

we use to calculate the degree of similarity? When we have several similar pre-trained

module candidates, is it possible if we can transfer the policy from multiple candidates

weighted by their degree of similarity (soft copy style), or we should choose the most

similar one? How would these two methods affect the performance of the agent?

The further research on solving those limitations are the key to increase the

generality of proposed framework, so it can be applied to complex real life problems.

6. Conclusion This paper proposes a novel task modularization framework built upon recent

advances in transfer and meta-learning to enhance the adaptability of complex and

unstructured robotic applications in manufacturing and logistics. Our research is motivated,

on one hand, by the 11.47% compound annual growth in the global market for industrial

21

robotics (MHI-Deloitte, 2019), and on the other hand, by the absence of rigorous

methodologies for efficient transfer of tasks in robotics for higher autonomy and

adaptability, which does not exist in current, start-from-the-scratch RL methods. We

incorporate the rich science of complex adaptive systems with recent advances in deep RL

to address a fundamental research question: How can a robot autonomously adapt to

complex and unstructured manufacturing and logistics environments (e.g., assembly or

piece-picking) through automated transfer of its learned knowledge across parametrically-

and non-parametrically-different tasks? We tackled this question by through a three-stage

framework for task modularization based on the principles of complex systems theory and

module neural network training based on DDPG. Our implementation of the proposed M-

DDPG on the Robotics environments of the OpenAI Gym indicated significant

improvement in both jumpstart and asymptotic performance of the agent compared to the

baseline DDPG.

The future research about this project is to solve the discussed limitations in section

5. First, evaluate the proposed framework on more difficult tasks which contain more

modules and inner structure between them. Second, choose appropriate metrics to

determine the degree of similarity between modules, and testing different methods to

transfer the knowledge to come up with one which can help the agent present the best

performance. Third, Graph neural network (GNN) (Battaglia et al.,2018) is such an

excellent tool to represent the relationship. Thus, in future research, generating GNN can

be a potential way to compose all modules together.

The long-term vision of the authors is to both increase the range of robotic

operations (RightHand Robotics, 2018) and minimize reprogramming through task

modularization and sharing of the learned knowledge across various tasks and fleets of

robots. By integrating the science of complex adaptive systems with artificial intelligence

techniques and scaling the notion of modularization and transfer up to a large network of

robots performing a wide variety of tasks, the proposed research will result in a

continuously evolving, shared knowledgebase that will eventually enable near-real-time

learning of new robotic tasks in manufacturing, logistics, and other industries. This, in turn,

will lead to less-labor-intensive, 24/7 operations with significantly shorter lead times,

serving the ultimate vision of Industry 4.0 for lot-size of one production. Our vision can be

22

further expanded to a longer-term research endeavor to: 1) Enhance adaptive learning in

networks of heterogeneous robots; 2) Improve future of work systems with human-in-the-

loop by enabling transfer learning between humans and piece-picking robots; 3) Extend the

developed models and frameworks to other robotic applications beyond manufacturing and

logistics.

23

REFERENCES

Alet, F., Lozano-Pérez, T. and Kaelbling, L. P. (2018) ‘Modular meta-learning’.

Andreas, J. et al. (2016) ‘Neural module networks’, in Proceedings of the IEEE

Computer Society Conference on Computer Vision and Pattern Recognition. doi:

10.1109/CVPR.2016.12.

Andrychowicz, M. et al. (2017) ‘Hindsight experience replay’, Advances in Neural

Information Processing Systems, 2017-Decem(Nips), pp. 5049–5059.

Baldwin, C. Y. and Clark, K. B. (2000) Design rules. Volume 1, The power of

modularity. MIT Press.

Barto, A. G. and Mahadevan, S. (2003) ‘Recent Advances in Hierarchical Reinforcement

Learning’, Discrete Event Dynamic Systems: Theory and Applications. doi:

10.1023/A:1022140919877.

Battaglia, P. W. et al. (2018) ‘Relational inductive biases, deep learning, and graph

networks’, pp. 1–40.

Botvinick, M. et al. (2019) ‘Reinforcement Learning, Fast and Slow’, Trends in

Cognitive Sciences. Elsevier Ltd, 23(5), pp. 408–422. doi: 10.1016/j.tics.2019.02.006.

Brockman, G. et al. (2016) ‘OpenAI Gym’, arXiv:1606.01540v1, pp. 1–4.

Caruana, R. (1997) ‘Multitask Learning’, Machine Learning. doi:

10.1023/A:1007379606734.

Chitnis, R., Kaelbling, L. P. and Lozano-Perez, T. (2019) ‘Learning quickly to plan

quickly using modular meta-learning’, in Proceedings - IEEE International Conference

on Robotics and Automation. doi: 10.1109/ICRA.2019.8794342.

Collobert, R. and Weston, J. (2008) ‘A unified architecture for natural language

processing’, in. doi: 10.1145/1390156.1390177.

Degris, T., White, M. and Sutton, R. S. (2012) ‘Off-policy actor-critic’, in Proceedings of

the 29th International Conference on Machine Learning, ICML 2012.

Deisenroth, M. P. et al. (2014) ‘Multi-task policy search for robotics’, in Proceedings -

IEEE International Conference on Robotics and Automation. doi:

10.1109/ICRA.2014.6907421.

24

Deng, L., Hinton, G. and Kingsbury, B. (2013) ‘New types of deep neural network

learning for speech recognition and related applications: An overview’, in ICASSP, IEEE

International Conference on Acoustics, Speech and Signal Processing - Proceedings. doi:

10.1109/ICASSP.2013.6639344.

Devin, C. et al. (2017) ‘Learning modular neural network policies for multi-task and

multi-robot transfer’, Proceedings - IEEE International Conference on Robotics and

Automation, pp. 2169–2176. doi: 10.1109/ICRA.2017.7989250.

Duan, Y. et al. (2016) ‘RL$^2$: Fast Reinforcement Learning via Slow Reinforcement

Learning’, pp. 1–14.

Finn, C., Abbeel, P. and Levine, S. (2017) ‘Model-agnostic meta-learning for fast

adaptation of deep networks’, 34th International Conference on Machine Learning,

ICML 2017, 3, pp. 1856–1868.

Frans, K. et al. (2018) ‘Meta learning shared hierarchies’, in 6th International

Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings.

García, J. and Shafie, D. (2020) ‘Teaching a humanoid robot to walk faster through Safe

Reinforcement Learning’, Engineering Applications of Artificial Intelligence. Elsevier

Ltd, 88(November 2019), p. 103360. doi: 10.1016/j.engappai.2019.103360.

Gianetto, D. A. and Heydari, B. (2015) ‘Network Modularity is essential for evolution of

cooperation under uncertainty’, Scientific Reports. Nature Publishing Group, 5(1), p.

9340. doi: 10.1038/srep09340.

Girshick, R. (2015) ‘Fast R-CNN’, in Proceedings of the IEEE International Conference

on Computer Vision. doi: 10.1109/ICCV.2015.169.

Gupta, A. et al. (2018) ‘Meta-reinforcement learning of structured exploration strategies’,

Advances in Neural Information Processing Systems, 2018-Decem, pp. 5302–5311.

Harlow, H. F. (1949) ‘The formation of learning sets’, Psychological Review. doi:

10.1037/h0062474.

Van Hasselt, H., Guez, A. and Silver, D. (2016) ‘Deep reinforcement learning with

double Q-Learning’, in 30th AAAI Conference on Artificial Intelligence, AAAI 2016.

Hessel, M. et al. (2019) ‘Multi-Task Deep Reinforcement Learning with PopArt’,

Proceedings of the AAAI Conference on Artificial Intelligence. doi:

10.1609/aaai.v33i01.33013796.

25

Heydari, B. and Dalili, K. (2015) ‘Emergence of modularity in system of systems:

Complex networks in heterogeneous environments’, IEEE Systems Journal, 9(1), pp.

223–231. doi: 10.1109/JSYST.2013.2281694.

Heydari, B., Mosleh, M. and Dalili, K. (2016) ‘From Modular to Distributed Open

Architectures: A Unified Decision Framework’. doi: 10.1002/sys.21348.

Hofmann, E. and Rüsch, M. (2017) ‘Industry 4.0 and the current status as well as future

prospects on logistics’, Computers in Industry. Elsevier B.V., 89, pp. 23–34. doi:

10.1016/j.compind.2017.04.002.

Hou, Y. et al. (2017) ‘A novel DDPG method with prioritized experience replay’, in 2017

IEEE International Conference on Systems, Man, and Cybernetics, SMC 2017. doi:

10.1109/SMC.2017.8122622.

Konda, V. R. and Tsitsiklis, J. N. (2000) ‘Actor-critic algorithms’, in Advances in Neural

Information Processing Systems.

Kusiak, A. (2018) ‘Smart manufacturing’, International Journal of Production Research.

Taylor & Francis, 56(1–2), pp. 508–517. doi: 10.1080/00207543.2017.1351644.

Lasi, H. et al. (2014) ‘Industry 4.0’, Business and Information Systems Engineering. doi:

10.1007/s12599-014-0334-4.

Levine, S. et al. (2016) ‘End-to-end training of deep visuomotor policies’, Journal of

Machine Learning Research.

Lillicrap, T. P. et al. (2016) ‘Continuous control with deep reinforcement learning’, in 4th

International Conference on Learning Representations, ICLR 2016 - Conference Track

Proceedings.

Lillicrap, T. P. et al. (2019) ‘Continuous control with deep reinforcement learning’,

arXiv:1509.02971 Help | Advanced Search.

Lu, Y. (2017) ‘Industry 4.0: A survey on technologies, applications and open research

issues’, Journal of Industrial Information Integration. Elsevier Inc., 6, pp. 1–10. doi:

10.1016/j.jii.2017.04.005.

Mishra, N. et al. (2018) ‘A simple neural attentive meta-learner’, 6th International

Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings,

pp. 1–17.

Mnih, V. et al. (2015) ‘Human-level control through deep reinforcement learning’,

26

Nature. Nature Publishing Group, 518(7540), pp. 529–533. doi: 10.1038/nature14236.

Mnih, V. et al. (2016) ‘Asynchronous methods for deep reinforcement learning’, in 33rd

International Conference on Machine Learning, ICML 2016.

Monostori, L. et al. (2016) ‘Cyber-physical systems in manufacturing’, CIRP Annals -

Manufacturing Technology, 65(2), pp. 621–641. doi: 10.1016/j.cirp.2016.06.005.

Moore, W. L., Louviere, J. J. and Verma, R. (1999) ‘Using Conjoint Analysis to Help

Design Product Platforms’, Journal of Product Innovation Management. doi:

10.1111/1540-5885.1610027.

Mosleh, M. and Heydari, B. (2017) ‘Fair topologies: Community structures and network

hubs drive emergence of fairness norms’, Scientific Reports. doi: 10.1038/s41598-017-

01876-0.

‘Neural relational inference with fast modular meta learning’ (no date).

Nolfi, S. (1997) ‘Using Emergent Modularity to Develop Control Systems for Mobile

Robots’, Adaptive Behavior. Sage PublicationsSage CA: Thousand Oaks, CA, 5(3–4), pp.

343–363. doi: 10.1177/105971239700500306.

OpenAI et al. (2019) ‘Dota 2 with Large Scale Deep Reinforcement Learning’.

Peter, H. W. and Chomsky, N. (1968) ‘Aspects of the Theory of Syntax’, The Modern

Language Review. doi: 10.2307/3722650.

Pritzel, A. et al. (2017) ‘Neural episodic control’, 34th International Conference on

Machine Learning, ICML 2017, 6, pp. 4320–4331.

Ramsundar, B. et al. (2015) ‘Massively Multitask Networks for Drug Discovery’, (Icml).

Ritter, S. et al. (2018) ‘Been there, done that: Meta-learning with episodic recall’, 35th

International Conference on Machine Learning, ICML 2018, 10(1), pp. 6929–6938.

Russell, S. and Zimdars, A. L. (2003) ‘Q-Decomposition for Reinforcement Learning

Agents’, in Proceedings, Twentieth International Conference on Machine Learning.

Schaul, T. et al. (2016) ‘Prioritized experience replay’, in 4th International Conference

on Learning Representations, ICLR 2016 - Conference Track Proceedings.

Schulman, J. et al. (2017) ‘Proximal Policy Optimization Algorithms’, pp. 1–12.

Silver, D. et al. (2014) ‘Deterministic policy gradient algorithms’, in 31st International

Conference on Machine Learning, ICML 2014.

Silver, D. et al. (2016) ‘Mastering the game of Go with deep neural networks and tree

27

search’, Nature. doi: 10.1038/nature16961.

Silver, D. et al. (2017) ‘Mastering the game of Go without human knowledge’, Nature.

doi: 10.1038/nature24270.

Simon, H. A. (1991) ‘The Architecture of Complexity’, in Facets of Systems Science.

Boston, MA: Springer US, pp. 457–476. doi: 10.1007/978-1-4899-0718-9_31.

Simpkins, C. and Isbell, C. (2019) ‘Composable Modular Reinforcement Learning’,

Proceedings of the AAAI Conference on Artificial Intelligence. doi:

10.1609/aaai.v33i01.33014975.

Sprague, N. and Ballard, D. (2003) ‘Multiple-goal reinforcement learning with modular

sarsa(O)’, in IJCAI International Joint Conference on Artificial Intelligence.

Sullivan, K. J. et al. (2001) ‘The structure and value of modularity in software design’, in

Proceedings of the 8th European software engineering conference held jointly with 9th

ACM SIGSOFT international symposium on Foundations of software engineering -

ESEC/FSE-9. New York, New York, USA: ACM Press, p. 99. doi:

10.1145/503209.503224.

Sutton, R. S. et al. (2000) ‘Policy gradient methods for reinforcement learning with

function approximation’, in Advances in Neural Information Processing Systems.

Tamar, A. et al. (2016) Learning from the Hindsight Plan -- Episodic MPC Improvement.

Taylor, M. E. and Stone, P. (2009) ‘Transfer Learning for Reinforcement Learning

Domains: A Survey’, Journal of Machine Learning Research, 10, pp. 1633–1685.

Teh, Y. W. et al. (2017) ‘Distral: Robust multitask reinforcement learning’, in Advances

in Neural Information Processing Systems.

Todorov, E., Erez, T. and Tassa, Y. (2012) ‘MuJoCo: A physics engine for model-based

control’, in IEEE International Conference on Intelligent Robots and Systems. doi:

10.1109/IROS.2012.6386109.

Uchibe, E., Asada, M. and Hosoda, K. (1996) ‘Behavior coordination for a mobile robot

using modular reinforcement learning’, in IEEE International Conference on Intelligent

Robots and Systems. doi: 10.1109/iros.1996.568989.

Uhlenbeck, G. E. and Ornstein, L. S. (1930) ‘On the theory of the Brownian motion’,

Physical Review. doi: 10.1103/PhysRev.36.823.

Vecerik, M. et al. (2019) ‘A practical approach to insertion with variable socket position

28

using deep reinforcement learning’, in Proceedings - IEEE International Conference on

Robotics and Automation. doi: 10.1109/ICRA.2019.8794074.

Vinyals, O. et al. (2019) ‘Grandmaster level in StarCraft II using multi-agent

reinforcement learning’, Nature. doi: 10.1038/s41586-019-1724-z.

Wang, J. X. et al. (2016) ‘Learning to reinforcement learn’, pp. 1–17.

Wang, W. et al. (2019) ‘Facilitating Human-Robot Collaborative Tasks by Teaching-

Learning-Collaboration from Human Demonstrations’, IEEE Transactions on

Automation Science and Engineering. doi: 10.1109/TASE.2018.2840345.

Wang, Z. et al. (2016) ‘Dueling Network Architectures for Deep Reinforcement

Learning’, in 33rd International Conference on Machine Learning, ICML 2016.

Weiss, K., Khoshgoftaar, T. M. and Wang, D. D. (2016) ‘A survey of transfer learning’,

Journal of Big Data. doi: 10.1186/s40537-016-0043-6.

Wixson, L. E. (1991) ‘Scaling Reinforcement Learning Techniques via Modularity’, in

Machine Learning Proceedings 1991. doi: 10.1016/b978-1-55860-200-7.50076-3.

Yu, T. et al. (2019) ‘Meta-World: A Benchmark and Evaluation for Multi-Task and Meta

Reinforcement Learning’, (CoRL).

Zhang, Y. and Yang, Q. (2017) ‘A Survey on Multi-Task Learning’, pp. 1–20.

Zhong, R. Y. et al. (2017) ‘Intelligent Manufacturing in the Context of Industry 4.0: A

Review’, Engineering, 3(5), pp. 616–630. doi: 10.1016/J.ENG.2017.05.015.

RightHandRobotics.Understanding the 3Rs of Robotic Piece-Picking, 2018. URL:

https://www.mmh.com/article/understanding_the_3rs_of_robotic_piece_picking.

A MODULAR REINFORCEMENT LEARNING METHOD FOR … · A MODULAR REINFORCEMENT LEARNING METHOD FOR...

Documents

Transcript of A MODULAR REINFORCEMENT LEARNING METHOD FOR … · A MODULAR REINFORCEMENT LEARNING METHOD FOR...