A MODULAR REINFORCEMENT LEARNING METHOD FOR … · A MODULAR REINFORCEMENT LEARNING METHOD FOR...
Transcript of A MODULAR REINFORCEMENT LEARNING METHOD FOR … · A MODULAR REINFORCEMENT LEARNING METHOD FOR...
A MODULAR REINFORCEMENT LEARNING METHOD FOR
ADAPTABLE ROBOTIC ARMS
A Thesis
Presented By
Qiliang Chen
to
The Department of Industrial Engineering
in partial fulfillment of the requirements for the degree of
Master of
Science in
the field of
Data Analytics Engineering
Northeastern University
Boston, Massachusetts
May 2020
ii
ABSTRACT
The vision of Industry 4.0 is to materialize the notion of lot-size of one through enhanced
adaptability of manufacturing and logistics operations to dynamic changes or deviations
on shop floors. Currently almost all industrial robots can only perform rote and repetitive
tasks in highly structured environment. Recent advances in meta reinforcement learning
and multi-task learning have the potential to successfully enable robots to adapt to a series
of highly-interrelated tasks by leveraging the prior knowledge, which increases the sample
efficiency. However, an assumption about degree of similarity within task sets must be
strictly obeyed, which is challenging when facing real life problems. Motivated by this
vital gap, this thesis develops a modular reinforcement learning framework to enhance the
efficient transfer of control policies from previously learned tasks. The proposed
framework contains three steps: modularization, modular reinforcement learning, and
modular composition. The experiments on the OpenAI Gym Robotics environments Reach,
Push, and Pick-and-Place indicate an average of 75% reduction in the number of iterations
to achieve 60% success rate, compared to the deep deterministic policy gradient (DDPG)
algorithm as a baseline. The significant improvements in jumpstart and asymptotic
performance of the agent creates promising opportunity to solve the current limitations of
industrial robot associated with sample-inefficiency and narrow task range through task
modularization and transfer learning.
iii
ACKNOWLEDGMENTS
Foremost, I would like to express my sincere gratitude to my advisor Prof. Mohsen
Moghaddam for the continuous support of my Master’s study and research, for his patience,
motivation, enthusiasm, and immense knowledge. His guidance helped me in all the time
of research and writing of this thesis. I could not have imagined having a better advisor and
mentor for my Master’s study.
Besides my advisor, I would like to thank the rest of my thesis committee: Prof.
Babak Heydari, for his encouragement, insightful comments, and hard questions.
Last but not the least, I would like to thank my family: my parents Weixin Chen
and Xiangmei Zeng, for giving birth to me at the first place and supporting me spiritually
throughout my life.
iv
TABLE OF CONTENT
1. ABSTRACT ....................................................................................................................ii
2. Introduction ................................................................................................................... 1
2.1 Background ............................................................................................................... 1
2.2 Problem setting and framework overview ................................................................. 3
3. Related work .................................................................................................................. 4
3.1 Reinforcement Learning ............................................................................................ 4
3.2 Meta Learning ........................................................................................................... 7
3.3 Multi-task Learning ................................................................................................... 8
3.4 Modularity ............................................................................................................... 10
4. Framework .................................................................................................................. 11
4.1 Modularization and module composition ................................................................ 12
4.2 Modular DDPG Reinforcement learning ................................................................ 13
5. Experiment and results ............................................................................................... 16
5.1 Experiment design ................................................................................................... 16
5.2 Experiment results ................................................................................................... 17
5.3 Discussion ............................................................................................................... 19
6. Conclusion ................................................................................................................... 20
7. REFERENCE .............................................................................................................. 23
v
LIST OF FIGURES
Figure 1: Overview of the proposed framework ................................................................ 4
Figure 2: Algorithm 1-- Modular-DDPG......................................................................... 15
Figure 3: Experiments setup and task modularization on the OpenAI Gym environments:
(a) Reach a randomly chosen position in 3D space. (b) Push an object to a randomly chosen
position in 2D space (tabletop). (c) Pick-and-Place to a randomly chosen position in 3D
space. ................................................................................................................................. 17
Figure 4: Preliminary results on Scenarios 1 ................................................................... 18
Figure 5: Preliminary results on Scenarios 2 ................................................................... 18
1
2. Introduction
2.1 Motivation Smart manufacturing is an emerging form of production integrating manufacturing
assets of today and tomorrow with sensors, computing platforms, communication
technology, control, simulation, data intensive modelling and predictive engineering
(Kusiak, 2018). One of the most important component of it is advanced robot owing to
their versatility and extensive use in a wide range of applications such as assembly, welding,
painting, packaging, labeling, and inspection, among others. Also, inspired by Industry
4.0’s goal of constructing smart factory (Monostori et al., 2016) (Lasi et al., 2014)
(Hofmann and Rüsch, 2017) (Lu, 2017) (Zhong et al., 2017), robot need to have the ability
to solve problems and make decisions independently of people, further possess the
flexibility to learn new skills faster and efficiently (Wang et al., 2019). However, the
current situation is almost all the real-world industry robots in manufacturing factory can
only solve the repetitive tasks under strictly structured-environment, which cannot fulfill
the requirement of real life problems. For example, in piece picking process, we require
the robots to pick items and place them on specific target place with order, under
unstructured environment.
The rapid development of Reinforcement learning (RL) help the robots to learn
some skills with little interference of human. For example, industrial robot can learn how
to insert a peg to a hole or Pick and Place object to a target place through RL with visual
sensor. The problem is RL can only help robot learn a single skill, when faced very similar
new tasks, like the change of the shape, rotation or background, the old knowledge will
have a bad performance. Comparing to human, we have a very strong ability to use our
prior knowledge to adapt to a similar new task. For example, people who have learnt how
to ride bicycle can master riding motorcycle very easily and faster, plenty examples exist.
This significant gap about robot how to transfer the prior knowledge to help mastering a
similar new task attracts research interest of many researchers and institutions, such as
Deep Mind (J. X. Wang et al., 2016) (Gupta et al., 2018) (Ritter et al., 2018) (Botvinick et
al., 2019), OpenAI and Siemens (Levine et al., 2016) (Duan et al., 2016) (Tamar et al.,
2
2016) (Vinyals et al., 2019). Thus, the research about improving the adaptability of robot
is very meaningful, which is also key step of leading to smart factory.
Comparing to human, DRL still has deficiencies in two aspects (J. X. Wang et al.,
2016). First, DRL is sample-inefficient, while human can attain reasonable performance of
wide range of tasks with comparatively little experience. Second, DRL can only specialize
in a narrow range of tasks, while human can flexibly adapt to changing of tasks. A lot of
recent researches focus on solving the first challenge. Meta-learning is an emerging topic
in artificial intelligence recently. The key idea of it is learning to learn (J. X. Wang et al.,
2016), which means the algorithm aims to conclude a method about how to learn a new
task from the experience within some interrelated tasks, so it can learn a new task after few
epochs with few examples, faster and more efficiently. Enhanced Meta learning with
episodic memory can refer memory of old history to help make quick and better estimation
of state value or decision, when facing similar state. Multi-Task Learning (MTL) aims to
let the agent learn several skills together, which can leverage useful information contained
in multiple related but not identical tasks to help improve the generalization performance
of all the tasks (Zhang and Yang, 2017). An extensive review about reinforcement learning,
Meta learning with episodic memory and Multi-Task Learning will be provided in Section
3.
Although Meta learning has a promising result about solving the first challenge of
RL (sample-inefficient), it has a strict assumption about the similarity degree within task
set. As pointed out by (Vinyals et al., 2019), current Meta RL methods are still limited to
very narrow task distributions that allow slight parametric difference between individual
tasks. When human facing unfamiliar tasks, we still can transfer some knowledge from
previous experience. For example, people who always play pokers can learn a new poker
game faster than who has no experience, because they can transfer the knowledge relate to
cards itself to help solve the new game, like the meaning of each cards or skills about
estimating opponents’ cards, so what they need to learn will be mainly the new rule.
Inspired by this phenomenon, we take a high-level combinatorial generalization approach
based on the notion of modularity (Devin et al., 2017) (Alet, Lozano-Pérez and Kaelbling,
2018) to address the second limitation of deep RL. As argued by Herbert Simon (Simon,
1991), modularity is the ubiquitous adaptability mechanism through which most biological,
3
social, economic, and software systems manage complexity in highly unstructured
environments (Nolfi, 1997) (Baldwin and Clark, 2000) (Sullivan et al., 2001) (Gianetto
and Heydari, 2015) (Heydari and Dalili, 2015). The implication of modular design in the
context of deep RL is simple: Although learning a single policy that performs optimally
across all tasks is non-optimal (Devin et al., 2017), learning and mixing-and-matching
simpler policies for sufficiently small and overlapping task modules may solve the second
aforementioned problem of current deep RL methods (i.e., specialization on one task).
2.2 Problem Statement and Framework Overview In this thesis, we propose a modular RL framework based on the notion of task
modularity and transfer learning about the knowledge of similar modules to solve diverse
robotic tasks in manufacturing and logistics. Our goal is to tackle the following
fundamental research question: How can a robot autonomously adapt to complex and
unstructured manufacturing and logistics environments (e.g., assembly or piece-picking)
through automated transfer of its learned knowledge across parametrically- and non-
parametrically different tasks, using the notion of modularity? The framework is
comprised of three stages (see Figure 1):
1) Modularization: Deciding on the degree and architecture of modularity is one the
key decisions to make, governed by a range of parameters that characterize the
degree of spatial variations (e.g., distance between different tasks or environments)
as well as temporal uncertainties (e.g., unexpected obstacles on the way, or dynamic
environments) (Heydari, Mosleh and Dalili, 2016). The objective of this stage is to
decompose tasks into reusable modules for combinatorial generalization to a wide
range of new tasks.
2) Modular training: The goal of this stage is to train separate functions to represent
each module decomposed from first stage, which means several independent neural
network need to be generated. In the training process, if similar pre-trained modules
are detected, we can transfer the pre-trained parameters to new module, or we
initialize it randomly and use RL algorithm to train from scratch. Each network
parameters of new module will be recorded for future use. The idea is to re-use the
pre-trained modules efficiently to boost and facilitate the new modules training in
the future (Alet, Lozano-Pérez and Kaelbling, 2018).
4
3) Composition: This stage is built on the assumption that there exists a compositional
scheme (Alet, Lozano-Pérez and Kaelbling, 2018) that enables forming of any new
task from a set of modules. Thus, the goal of this stage is to learn the potential
relationship between each modules, and based on it to connect all modules together,
which then can solve the whole task.
The proposed framework has been validated on the Robotics environment of the
OpenAI Gym (Brockman et al., 2016), with deep deterministic policy gradient (DDPG)
(Lillicrap et al., 2019) serving as the deep RL model. Section 4 provides the details of the
framework, and Section 5 presents the experiments and results. Section 6 concludes the
thesis along with a summary of limitations and proposed directions for future research.
3. Related work
3.1 Reinforcement Learning
Recently, Reinforcement Learning (RL) has achieved incredible success in several
different areas. The first time Deep Reinforcement Learning(DRL) attracts all the
researchers’ focus in the world, is when computer using DRL algorithm to learn how to
play Atari games and reach the super-human performance level (Mnih et al., 2015). Then,
Task Module Repository
Learned Modules.
Task Modularization Building Blocks. Module Optimization.
Module Mapping & Transfer Learning Source Module Identification. Transfer Parameters.
Modular Reinforcement Learning Learning & Composition of Policies/Actions.
Environment Uncertainty Characterization
Module
Pre-trained? Random Initialization
Random Module Net Parameters.
Validation OpenAI Gym Robotics Environments
No
Yes
Expected New Tasks
Transfer
Update
Task Description &
Similarity Index
Figure 1: Overview of the proposed framework
5
a series of significant researches in Reinforcement learning come out. Go game is known
as the most challenging game in Artificial Intelligence for decades. In 2016 the agent
AlphaGo developed by DeepMind defeated the legendary Go game player Mr. Lee Sedol
with 4-1 using DRL algorithm (Silver et al., 2016), and the updated agent AlphaGo
Zero achieved superhuman performance, winning 100–0 against the champion-defeating
AlphaGo (Silver et al., 2017). StarCraft II is a real-time strategy video game with highly
complex environment and diverse options of decision. In 2019, the agent named AlphaStar
from DeepMind uses multi-agent reinforcement learning algorithm to master this game and
be rated above 99.8% of officially ranked human players (Vinyals et al., 2019). García and
Shafie, 2020 uses a safe RL algorithm to teach a humanoid robot to walk faster.
3.1.1 Value-based and Policy-based Reinforcement Learning
There are two types of reinforcement learning algorithms, model-free and model-
based. The difference between these two types is model-based algorithms will first learn a
representation of the whole environment, then plan a solution for this environment, while
model-free algorithms will not. Recently, model-free RL algorithms have been researched
extensively, one of the most basic but popular and powerful value-based model-free RL
algorithms is Q-learning (Mnih et al., 2015), it basically estimate the Q value function,
which represents the goodness of taking an action based on a specific states, and the policy
is based on choosing the action which maximize the Q value. However, original Q-learning
algorithm are very easy to overestimate Q value, which will influence the performance.
Instead of using the same values to select and to evaluate an action, Double Q-learning
decouple selection from evaluation by using two Q value functions (Van Hasselt, Guez and
Silver, 2016). In dueling DQN, it changes the basic Q-learning architecture into two
streams -- state value function and advantages of each action, and combine them together
to get final Q value function, which can better clarify that reward mainly comes from states
or actions (Z. Wang et al., 2016). Prioritized experience replay is a strategy to increase the
probability of samples from the experience replay, which have a larger change when
updating the parameters (Schaul et al., 2016). It is a general strategy, which can be applied
on different RL algorithms. There are also some policy-based RL methods, which updates
policy based on the gradient calculated from the reward by parameters (Sutton et al., 2000).
The above value-based methods are critic-only, and the policy based method is actor-only.
6
The actor-critic architecture aims at combining the strong points of both methods, critic
approximates a value function, which is then used to update the actor’s policy parameters
for performance improvements (Konda and Tsitsiklis, 2000). However, one of the biggest
drawbacks in original actor-critic is high variance, so learning an advantage function (value
function minus a baseline value) will prevent it, which is known as advantage actor-critic
method (Mnih et al., 2016). In order to make data collected from interaction with
environment more efficient, importance sampling can reuse the past experience to update
the policy (Degris, White and Sutton, 2012). Proximal policy optimization (PPO) algorithm
add a constraint on the difference between old policy and new policy when using
importance sampling to learn from previous experience, which can make learning process
much more stable (Schulman et al., 2017).
3.1.2 Continuous Action Space Reinforcement Learning
In the above problem setting, the action space is always discrete, like in the Atari
games, the actions may be “up”, “down”, “left”, “right” and “fire”, so output of policy is a
probability distribution of different action given a state, in the optimal policy we just
choose the one with largest probability. However, there are lots of real task whose action
space is continuous, like piece picking in our case, the action will be the angle or velocity
of joints of robotic arm. Deterministic Policy Gradient (DPG) use actor to map the states
to deterministic actions, rather than a probability distribution, so it can be applied to
continuous action space task (Silver et al., 2014). Deep Deterministic Policy Gradient
(DDPG) is trained off-policy with samples from a replay buffer. It also used a target Q
network to give consistent targets during temporal difference backups (Lillicrap et al.,
2019).
In our case, the reward signals are sparse and binary, that means the agent may keep
interacting with the environment without any positive reward, thus learning nothing. One
of the common methods to solve this problem is using reward shaping. With human prior
knowledge the agent will be guided to reach the final goal step by step. Like the agent
developed by OpenAI learn to play DOTA2 with guidance like receiving negative reward
when character dying or positive reward when collecting resource, while the final goal is
winning the game (OpenAI et al., 2019). However in some tasks, it is very hard to design
7
a good reward shaping strategy. Hindsight Experience Replay is known as learning from
failure, it creates some fake goals in each trail. So even the agent does not reach the real
goal, it can still receive some reward to learn something from every experience
(Andrychowicz et al., 2017). DDPG + HER have a pretty good performance on OpenAI
gym robotic environment, so in our experiments, we use this algorithm combination as a
benchmark to evaluate the performance of our method.
3.2 Meta Learning Although Reinforcement learning has achieved a super-human level performance
in single task, DRL still has deficiencies in two aspects comparing to human (J. X. Wang
et al., 2016). First, DRL is sample-inefficient, while human can attain reasonable
performance of wide range of tasks with comparatively little experience. Second, DRL can
only specialize in a narrow range of tasks, while human can flexibly adapt to changing of
tasks. A lot of recent researches focus on solving the first challenge. The reasons of
slowness of current reinforcement learning algorithms are from two aspects: weak
inductive bias and incremental parameter adjustment (Botvinick et al., 2019). So there are
two kinds of methods focus on solving this problem: Meta-RL and Episodic RL.
Meta-Learning is known as learning to learn, which comes from psychology
(Harlow, 1949). And this idea can also be applied to RL situations. Basically, it means
when agent faces a new environment or task, it can leverage the prior knowledge (inductive
bias) learned from previous tasks to master the new one faster. For example, people who
know how to ride bicycle will learn how to ride motorcycle faster than who has no previous
experience. This idea can be implemented in several ways. One way is to use recurrent
neural network (RNN) and train it on a series of tasks come from same distribution, to
maximize the total reward on all the tasks. Since it will take all the previous information
into account, it can learn common knowledge across tasks, which will let it solve a new
task faster (Duan et al., 2016) (J. X. Wang et al., 2016). A simple but powerful algorithm
named MAML (Model-Agnostic Meta-Learning) try to find a good initial parameters
which will have the best performance after one update from it, on a series of tasks (Finn,
Abbeel and Levine, 2017). Mishra et al., 2018 combined temporal convolution layers with
causal attention layers, which avoided the exponentially increasing number of layers when
using RNN as meta-learner. Some researches focus on using Meta-RL to learn an
8
exploration strategy. Gupta et al., 2018 add a latent state to the input of the policy which
is a Gaussian distribution, and use MAML to learn a good latent state, which then improve
the exploration process. Experiments executed on 50 robotics environment in (Yu et al.,
2019) set up a benchmark for current multi-task and meta-learning algorithm, which give
us a good reference to evaluate our model in the future.
Episodic RL can record the situations the agent encountered previously including
the states and the action to take, so when facing a new situation, the agent will go back to
the memory and find the most similar situation, then take the associate action (Pritzel et al.,
2017). By doing that, the agent can take behavior immediately without long time fine tune
for the parameters. Ritter et al., 2018 try to combine episodic memory with Meta-RL
architecture, the proposed model epL2RL outperform the model without episodic memory
a lot in several experiments.
Transfer learning try to transfer the prior knowledge from an old task to a new task,
to help agent mastering new task faster. The ideas of Meta-RL and Episodic RL are very
similar to transfer learning. They also try to use the prior knowledge or memory to adapt
to a new task efficiently. So we can use some metrics in transfer learning to quantify the
performance of Meta-RL algorithms, and we also use some of them in our experiments to
evaluate our model. There are several metrics available to use: jumpstart, asymptotic
performance, total reward, transfer ratio, time to threshold (Taylor and Stone, 2009).
3.3 Multi-task Learning
Multi-task learning (MTL) is known as learning several related tasks together, by
leveraging the useful information to help improve the generalization performance of all the
tasks. For example, when human learn to ride bicycle and tricycle together, the experience
in learning to ride a bicycle can be utilized in riding a tricycle and vice versa (Zhang and
Yang, 2017). The mechanism seems to be very similar to transfer learning (Weiss,
Khoshgoftaar and Wang, 2016), except for the objective difference. MTL aims to improve
the performance over all the tasks, while transfer focus on the improvements of the target
tasks. Meta learning aims to learn a common learning method by training among a series
of related tasks, so when facing a new task, it can adapt to it with few samples and short
training process. Unlike Meta learning, MTL aims to learn the tasks themselves, so a
trained MTL model can solve tasks without future adaptation. Because of the main idea
9
about leveraging shared knowledge between tasks, it is a good solution to solve the sample-
inefficient problem in Deep learning (Caruana, 1997), especially when the labeled data is
hard to collect, like medical data. Recent research in MTL has been successful across all
applications in machine learning, from natural language processing (Collobert and Weston,
2008) and speech recognition (Deng, Hinton and Kingsbury, 2013) to computer vision
(Girshick, 2015) and drug discovery (Ramsundar et al., 2015). Since our focus is in RL
problem setting, so we mainly review some achievements of MTL in solving RL problems
in details.
One of the promising benefits in MTL is the shared parameters can help improve
the performance among all the related tasks. However, in practice, shared knowledge
between different tasks can interfere negatively, which lead to unstable low-efficiency
training. (Teh et al., 2017) proposed a novel approach named “Distral” for joint training of
multiple tasks. The main idea is instead of sharing parameters between different tasks
directly, all the tasks share a “distilled” policy that captures common behavior across tasks.
Agent is forced to solve the specific task without going too far from distilled policy, which
helps training become much more robust and efficient. A general issue in MTL is the
imbalanced resource distribution among the tasks. In other words, some tasks appear more
salient to the learning process, for instance because of the density or magnitude of the in-
task rewards, which result in the loss of generality. (Hessel et al., 2019) propose a method
which can automatically adapt the contribution of each task to the agent’s updates, so that
all tasks have a similar impact on the learning dynamics. The proposed method can help
one single policy to master 57 diverse Atari games with super-human performance.
(Deisenroth et al., 2014) used policy-search method to multitask learning for robots with
stationary dynamics. The key idea is to explicitly parametrize by tasks. Thus, enable the
policy to generalize from training tasks to similar, but unknown, tasks at test time. This
generalization is phrased as an optimization problem, which can be solved jointly with
learning the policy parameters. Meta world (Yu et al., 2019) presents the performance of
state of the art MTL algorithms on 50 environments. The result shows in scenario with 10
tasks, the agent can reach 80% success rate, while in scenario with 50 tasks, the agent can
only reach 48% success rate, which shows that with the complexity of high level goal
increase(the number of tasks in MTL), the performance of MTL agent decrease
10
significantly. Meta- world results of MTL algorithm is an important benchmark for future
research as a reference.
3.4 Modularity
Modularity has long been recognized as an effective adaptability mechanism and
has been an active area of research in a wide range of different academic disciplines, since
the influential work by Herbert Simon (Simon, 1991). Simon argues that near-
decomposability enables systems to respond effectively to external changes without
disrupting the system as a whole. Modularity has also been recognized as an essential
concept in architecting engineering products, processes, and organizations. It has been
shown to increase product and organizational variety (Eppinger and Ulrich 2015), the rate
of technological and social innovation (Baldwin and Clark, 2000), market dominance
through interface capture (Moore, Louviere and Verma, 1999), cooperation and trust in
networked systems (Gianetto and Heydari, 2015) (Mosleh and Heydari, 2017).
The ubiquity of modularity in various complex systems as an adaptability
mechanism makes it a suitable candidate for incorporating it within RL algorithms to
achieve the adaptability that can enable efficient transfer learning across different tasks
with both parametric and non-parametric variations. This, in fact, has been a goal for some
AI researchers for decades. Leveraging modularity to scale up RL capabilities dates back
to the early 90s (Wixson, 1991) (Uchibe, Asada and Hosoda, 1996), following and enabled
by the development of the Q-learning algorithm. However, modularity has been used with
different meanings and approaches in AI since then and it is important to distinguish
between different modes that modularity has been used in meta-learning. With some
simplifications, previous works in modular meta-learning can be divided into two general
category, based on their overall approach towards modularity. Here we briefly discuss
these two approaches.
The first approach, dominant in the pre-deep RL era, is based on hierarchical
learning (Barto and Mahadevan, 2003) in which most frameworks consist of two separate
mechanisms for task decomposition (Singh, 1992) followed by behavior coordination
(Uchibe, Asada and Hosoda, 1996). In this approach, different modules act as different
action voters. That is, each module observes the action taken by the agent, the state
transition, and a reward signal specific to the module. At each time step, the agent combines
11
the action preferences of the modules to compute a joint policy (Russell and Zimdars, 2003)
(Sprague and Ballard, 2003) (Simpkins and Isbell, 2019). More recently, Frans et al., 2018
applied hierarchical RL using deep neural networks by pre-training a pool of common
neural network sub-policies across tasks and then using a task specific master-policy
neural network to select the appropriate sub-policy.
Unlike the first approach which considers modules as distinct (goal-specific) policy
agents whose action recommendations need to be aggregated for each task, the second
approach directly applies modularity to the deep neural network architecture. Modules are
reusable neural network functions that are pre-trained and can then be recombined in
different ways to undertake new tasks. The basic idea in this approach is that rather than
training a single network on a large number of training data, one can simultaneously train
a large number of different networks, while tying their parameters together, which in the
end generates a set of reusable neural network modules. This idea has recently been applied
to a variety of applications such as real-world reasoning problems (Andreas et al., 2016),
task and motion planning (Chitnis, Kaelbling and Lozano-Perez, 2019), and robotics
(Devin et al., 2017). In the latter, authors show that neural network policies can be
decomposed into task-specific and robot specific modules where the former is shared
across robots and the latter is shared across tasks. Our work builds on Alet, Lozano-Pérez
and Kaelbling, 2018 in which the authors use a set of neural network modules for a set of
basic functions that can be re-tuned and recombined using adaptive structures in the face
of new tasks. Alet, Lozano-Pérez and Kaelbling, 2018, however, formulate and apply this
method to a set of supervised robotic problems and do not extend it to RL.
4. Framework As discussed in the background section, deep RL has achieved recent remarkable
success in reaching human-level performance in tasks with discrete and low-dimensional
action-spaces such as playing Atari via the DQN algorithm (Mnih et al., 2015). Several
algorithms have been introduced in recent years for dealing with continuous, high-
dimensional action-spaces among which the DDPG algorithm Lillicrap et al., 2016 has
demonstrated great success in accomplishing relatively simple tasks such as pendulum,
cartpole swing up, or puck shooting. For more complex, continuous and high-dimensional
12
action-space tasks such as robotic assembly or piece-picking, however, current off-policy
RL algorithms such as DDPG may not be directly applicable (see, e.g., (Andrychowicz et
al., 2017, Vecerik et al., 2019)) due to their inherent complexity. We tackle this problem
through task modularization and module composition in order to enable the transfer of
policies across different tasks with non-parametric variations (Vinyals et al., 2019).
Without loss of generality, we implement this notion through a modular DDPG (M-DDPG)
algorithm and test it on three OpenAI Gym Robotics environments. Details of the proposed
framework are presented next.
4.1 Modularization and Module Composition
Deciding on the degree and architecture of modularity is one the key decisions to
make, governed by a range of parameters that characterize the degree of spatial variations
(e.g., distance between different tasks or environments) as well as temporal uncertainties
(e.g., unexpected obstacles on the way, or dynamic environments) (Heydari, Mosleh and
Dalili, 2016). In this context, the objective of modularization is to decompose tasks
into reusable modules for combinatorial generalization to a wide range of new tasks. We
present a formalism of the modularization process here and leave the
development of a methodology for automated identification of task module for
future research. The underlying assumption of task modularization is that there exists
a composition function (Alet, Lozano-Pérez and Kaelbling, 2018) for forming any new
task from a set of modules. After learning the composition function, we can decompose
the new task into several modules following the relationship we have learned between
each other. Assuming we can train each module by using the proposed M-DDPG
algorithm, we can do composition based on pre-trained modules, to form a whole
function, which will solve the new task. Thus, the key idea here is to learn the potential
composition function.
The structure of Graph Neural Network (GNN) naturally supports combinatorial
generalization because they do not perform computations strictly at the system level, but
also apply shared computations across the entities and across the relations as well
(Battaglia et al., 2018). The idea behind GNN is “infinite use of finite means” (Peter and
Chomsky, 1968), represents the re-use of same components in different compositional
structure can result in various effects. Thus we apply a GNN to fulfill our requirement
of learning the composition function referring to the structure setting in Alet et al.,
2019. A graph is
13
defined as a 3-tuple 𝐺𝐺 = (𝑢𝑢,𝑉𝑉,𝐸𝐸), The u is a global attribute, the 𝑉𝑉 = {𝑣𝑣𝑖𝑖}𝑖𝑖=1:Nv is the
set of nodes (of cardinality 𝑁𝑁𝑣𝑣), 𝐸𝐸 = {(𝑒𝑒𝑘𝑘 , 𝑟𝑟𝑘𝑘 , 𝑠𝑠𝑘𝑘)}𝑘𝑘=1:𝑁𝑁𝑒𝑒 is the set of edges (of cardinality
𝑁𝑁𝑒𝑒), where each 𝑒𝑒𝑘𝑘 is the edge’s attribute, r𝑘𝑘is the index of the receiver node, and 𝑠𝑠𝑘𝑘 is the
index of the sender node (Alet et al., 2019). In our case, each node represents a module
decomposed from the task, each edge between two nodes represents the relationship
between two corresponding modules. Graph is a directed, when constructing edge, switch
between sender and receiver will make a difference.
4.2 Modular DDPG Reinforcement Learning
GNN will help us solve the modularization and composition steps efficiently. It
also allows us to re-use the pre-trained modules to initialize the new ones, which can have
a jumpstart when facing the new task. Thus, when using the proposed M-DDPG to train
new modules, we will first check the similar modules existed in the module repository and
transfer the prior knowledge to the new module; if not, initialize it randomly and train it
from scratch. When facing the second situation, we proposed a novel algorithm named
Modular DDPG to solve the training process.
A standard DDPG setup with fully observed environments (Hou et al., 2017) is
considered for each task module. Let 𝑓𝑓 and 𝜃𝜃 denote the set of neural networks and the
respective parameters representing a task module. 𝑓𝑓 comprises four neural networks
including the actor's policy network, the critic's Q network, the actor's target policy network,
and the critic's target Q network, with network parameters denoted by 𝜃𝜃 =
(𝜃𝜃𝜇𝜇; 𝜃𝜃𝑄𝑄; 𝜃𝜃𝜇𝜇′; 𝜃𝜃𝑄𝑄′), in that order. 𝑄𝑄(𝑠𝑠, 𝑎𝑎) is the state-action pair value and determined
policy 𝜇𝜇(s) = 𝑎𝑎. For each given module of a new task, the actor and critic networks are
initialized with the parameters of the most related source module denoted by 𝜃𝜃∗𝜇𝜇 and 𝜃𝜃∗
𝑄𝑄.
Note that both set of parameters are initialized randomly, if no such pre-learned module
exists. The critic's loss function is thus calculated as 𝐿𝐿 = 1/𝑁𝑁∑ [𝑦𝑦𝑖𝑖 − 𝑄𝑄(𝑠𝑠𝑖𝑖 ,𝑎𝑎𝑖𝑖|𝜃𝜃𝑄𝑄)]2𝑖𝑖 ,
where 𝜃𝜃𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑄𝑄 ← 𝜃𝜃∗
𝑄𝑄 , 𝑦𝑦𝑖𝑖 = 𝑟𝑟𝑖𝑖 + 𝛾𝛾𝑄𝑄′(𝑠𝑠𝑖𝑖+1, 𝜇𝜇′(𝑠𝑠𝑖𝑖+1|𝜃𝜃𝜇𝜇′)|𝜃𝜃𝑄𝑄′) , 𝑁𝑁 is the size of the
experiences mini-batch sampled from the replay buffer, and γ is a discount factor. The
actor's policy parameters over the same mini batch of size N is calculated as ∇𝜃𝜃𝜇𝜇𝐽𝐽(𝜃𝜃) =
1/𝑁𝑁∑ [∇𝑎𝑎𝑄𝑄(𝑠𝑠,𝑎𝑎|𝜃𝜃𝑄𝑄)|𝑠𝑠=𝑠𝑠𝑖𝑖,𝑎𝑎=𝜇𝜇(𝑠𝑠𝑖𝑖)∇𝜃𝜃𝜇𝜇𝜇𝜇(𝑠𝑠|𝜃𝜃𝜇𝜇)|𝑠𝑠=𝑠𝑠𝑖𝑖]𝑖𝑖 , where 𝜃𝜃𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝜇𝜇 ← 𝜃𝜃∗
𝜇𝜇.
14
The notion of soft updates (Mnih et al., 2015) (Lillicrap et al., 2016) is applied for
updating the actor and critic target networks as 𝜃𝜃𝜇𝜇′ ← 𝜏𝜏𝜃𝜃𝜇𝜇 + (1 − 𝜏𝜏)𝜃𝜃𝜇𝜇′ and 𝜃𝜃𝑄𝑄′ ←
𝜏𝜏𝜃𝜃𝑄𝑄 + (1 − 𝜏𝜏)𝜃𝜃𝑄𝑄′ respectively (𝜏𝜏 ≪ 1). This is accompanied by one-the-y explorations
based on the Ornstein-Uhlenbeck process (Uhlenbeck and Ornstein, 1930). The calculated
state (𝑠𝑠𝑖𝑖) and reward (𝑟𝑟𝑖𝑖) together with the actions (𝑎𝑎𝑖𝑖) are then utilized to accumulate a
replay buffer storing sampled experiences as (𝑠𝑠𝑖𝑖;𝑎𝑎𝑖𝑖; 𝑟𝑟𝑖𝑖; 𝑠𝑠𝑖𝑖+1) in a finite-size memory for
updating neural network parameters. In standard RL, reward is typically assigned from
(-1 0) implying missing and achieving a goal g, respectively. The pseudocode of M-DDPG
is presented in Algorithm 1(see Figure 2).
When facing high-level and complex tasks like piece piking, it is very difficult for
the agent to finish the task in the early stage when the policy is very poor. In other words,
the reward is too sparse so the agent can learn nothing because no matter what actions the
agent take, it can only receive negative reward, which preventing the agent from figuring
out which action is better. Hindsight replay experience (HER) is a strategy to boost the
training process when the goal is sparse (Andrychowicz et al., 2017). It creates fake goals
(typically the end state after each action) for the agent in every time step. The episodes are
therefore reexamined with a number of fake goals resulting in a replay buffer associated
with the original and the fake goals. The replay buffer is used for sampling mini-batches
to update the actor and critic networks. Thus, the agent can always learn something even
from mistake. Logically, the fake goals represent kinds of intermediate goals to guide the
agent approach the real goal gradually, due to the strong ability of generalization of neural
networks. In the thesis, we HER to enhance the effect of M-DDPG, which boosts the
training process and results in a better performance.
15
Figure 3: Algorithm 1-- Modular-DDPG
16
5. Experiment and Results This section evaluates the performance of the proposed framework on a set of
Robotics environments on the OpenAI Gym (Brockman et al., 2016). The OpenAI Gym
provides various virtual standard benchmark environments for RL research and has
contributed significantly to recent advances in AI research, as evident by the high
percentage of publications that use such environments for validation and benchmarking.
The existing Robotics environments allow experiments on a set of virtual goal-oriented
tasks including reaching an object positioned randomly in 3D space, pushing/sliding an
object to a goal position in 2D space, and picking an object and placing it on a random
position in 3D space. The Gym environments are all enabled via the MuJoCo physics
simulator (Todorov, Erez and Tassa, 2012).
5.1 Experiment Design
Each module neural network is trained as a DDPG (Lillicrap et al., 2016) with HER
(Andrychowicz et al., 2017) for reward shaping (see Algorithm 1). All four deep neural
networks we are using (i.e., training actor network, target actor network, training critic
network, and target critic network) comprise three fully-connected hidden layers with 256
neurons in each layer. Each training epoch is comprised of 50 cycles each cycle with two
rollouts. In each rollout, the agent interacts with the environment for 50 steps. The network
updates parameters 40 times after 50 steps interaction during each rollout. Other hyper-
parameters settings are similar to (Andrychowicz et al., 2017). For training using M-DDPG,
we first pre-train the networks on a set of tasks, and then utilize the pre-trained networks
for initializing the similar modules of a new task as explained in Section 3. We then allow
the agent to interact with the new environment and update the network parameters. The
experiments were executed on a computer with CPU AMD Ryzen Threadripper 2970WX
24-Core Processor 3.00GHZ, on which every epoch needed an average of 70 seconds to
run. The experiments are designed to compare the performance of M-DDPG against DDPG
as a baseline with respect to jumpstart and asymptotic performance, as suggested by Taylor
and Stone, 2009.
17
5.2 Experimental Results
The proposed framework, specifically M-DDPG, was implemented and tested on
three Robotics environments of the OpenAI Gym. Task modularization was conducted by
decomposing the initial tasks into sub-tasks associated with before and after reaching the
object (see Figure 3). For example, consider Reach and Push as the source tasks already
learned, and Pick-and-Place as the target task to be learned. Now let us define three source
task modules: (A) Reach, (B) Push before reaching the object, and (C) Push after reaching
the object. Let us also define two target task modules: (D) Pick-and-Place before reaching
the object, (E) Pick-and-Place after picking the object up. A potential mapping for this
scenario would therefore be 𝐷𝐷 → 𝐵𝐵 or 𝐷𝐷 → 𝐴𝐴. The experiments have been conducted on
two scenarios:
• Scenario 1 (Figure 4). Transfer the reaching module of the Pick-and-Place task from
(i) the Reach task, or (ii) the reaching module of the Push task. Baseline: DDPG
(no transfer).
• Scenario 2 (Figure 5). Transfer the reaching module of the Push task from (i) the
Reach task, or (ii) the reaching module of the Pick and-Place task. Baseline: DDPG
(no transfer).
Figure 4: Experiments setup and task modularization on the OpenAI Gym environments: (a) Reach a randomly chosen position in 3D space. (b) Push an object to a randomly chosen position in 2D space (tabletop). (c) Pick-and-Place to a randomly chosen position in 3D space.
18
Figure 5: Preliminary results on Scenarios 1
Figure 6: Preliminary results on Scenarios 2
19
Results also indicate the importance of selecting ‘the right’ source task: in Reach,
the agent is rewarded for reaching the object regardless of the ‘side’ reached, while in Push
and Pick-and-Place, the agent may receive no reward even when it reaches the object if it
is on the wrong side. In the graph, X-axis is the number of epochs that the agent has been
trained, the Y-axis is the success rate of the performance on testing process, and we execute
100-round tests on each epoch. To avoid the influence of noise and randomness, agent in
each environment setup is trained for 5 times. The real lines represent the mean
performance of each environment setting, the shadow represents the variance of it.
The results show that the proposed framework significantly outperforms the
baseline DDPG in boosting both the jumpstart and asymptotic performance of the robot
(Taylor and Stone, 2009). That is, the number of iterations to achieve 60% success rate has
been reduced by over 80% in Scenarios 1 and over 70% in Scenarios 2. Further, the agent
converges much faster in both scenarios by M-DDPG compared to the baseline DDPG.
Moreover, the agent is able to fully learn the Pick-and-Place task (transfer from both Reach
and Push), while the baseline DDPG is unable to advance beyond 70% success rate on
average. One interesting observation is that transfer from Reach yields slightly lower
performance in Scenario 2--transfer from PNP (see Figure 5.3). We speculate the
underlying reason to be that in Reach, the agent is rewarded for reaching the object
regardless of what side of the object is has reached; while in Push the agent may receive
no reward even when it reaches the object if it is on the wrong side. This behavior is not
much evident in Scenario 1, which may be due to the fact that the Pick-and-Place agent
must reach the center of the object not a specific side of it. Another observation is that the
success rate of the agent has relatively lower variance under M-DDPG. We speculate that
this behavior is due to the introduction of an inductive bias (Botvinick et al., 2019) through
transferring from similar, previously learned task modules.
5.3 Discussion
The results shown on the graph demonstrate that our proposed framework can boost
the training process and reach a much better performance than baseline algorithm DDPG.
However, there are some limitations of the experiments, which needs further research.
First, the tasks and the environments used in this thesis are simple and
straightforward. Both the Push task and Pick-and-Place task can be decomposed into only
20
two modules, the same as composition process. The boundary between these two modules
is very clear, which is the moment when robot gripper touched the object. Also, the
relationship between these two modules is easily to be discovered, the constructed graph
based on them contains two nodes and only one directed edge, pointed from reach module
to Push/PNP module. However, in real life problems, the number of modules decomposed
from a task can be much larger, and the potential relationship behind modules will be
diverse, including single module, compositional structure, weighted ensemble and general
function-composition tree (Alet, Lozano-Pérez and Kaelbling, 2018). Thus, further
evaluation of the proposed framework needs to be done on more difficult tasks, which
contain more modules with complex inner structure, to demonstrate the efficiency of
modularization and composition steps.
Second, in this thesis, we assume the result of similarity checking between new
modules and pre-trained modules in module repository is binary and unique, the same or
totally different. So when finding out the target pre-trained module, we will initialize the
exact parameters of it on the new module, which is like a hard copy style. However, the
interesting observation between transferring from Reach and transferring from PNP in
Scenario 2, reflects the situation that there may exist several similar pre-trained modules
in module repository, and their degree of similarity is different. Thus, several new
questions are posted related to transfer between task modules: Which kind of metrics can
we use to calculate the degree of similarity? When we have several similar pre-trained
module candidates, is it possible if we can transfer the policy from multiple candidates
weighted by their degree of similarity (soft copy style), or we should choose the most
similar one? How would these two methods affect the performance of the agent?
The further research on solving those limitations are the key to increase the
generality of proposed framework, so it can be applied to complex real life problems.
6. Conclusion This paper proposes a novel task modularization framework built upon recent
advances in transfer and meta-learning to enhance the adaptability of complex and
unstructured robotic applications in manufacturing and logistics. Our research is motivated,
on one hand, by the 11.47% compound annual growth in the global market for industrial
21
robotics (MHI-Deloitte, 2019), and on the other hand, by the absence of rigorous
methodologies for efficient transfer of tasks in robotics for higher autonomy and
adaptability, which does not exist in current, start-from-the-scratch RL methods. We
incorporate the rich science of complex adaptive systems with recent advances in deep RL
to address a fundamental research question: How can a robot autonomously adapt to
complex and unstructured manufacturing and logistics environments (e.g., assembly or
piece-picking) through automated transfer of its learned knowledge across parametrically-
and non-parametrically-different tasks? We tackled this question by through a three-stage
framework for task modularization based on the principles of complex systems theory and
module neural network training based on DDPG. Our implementation of the proposed M-
DDPG on the Robotics environments of the OpenAI Gym indicated significant
improvement in both jumpstart and asymptotic performance of the agent compared to the
baseline DDPG.
The future research about this project is to solve the discussed limitations in section
5. First, evaluate the proposed framework on more difficult tasks which contain more
modules and inner structure between them. Second, choose appropriate metrics to
determine the degree of similarity between modules, and testing different methods to
transfer the knowledge to come up with one which can help the agent present the best
performance. Third, Graph neural network (GNN) (Battaglia et al.,2018) is such an
excellent tool to represent the relationship. Thus, in future research, generating GNN can
be a potential way to compose all modules together.
The long-term vision of the authors is to both increase the range of robotic
operations (RightHand Robotics, 2018) and minimize reprogramming through task
modularization and sharing of the learned knowledge across various tasks and fleets of
robots. By integrating the science of complex adaptive systems with artificial intelligence
techniques and scaling the notion of modularization and transfer up to a large network of
robots performing a wide variety of tasks, the proposed research will result in a
continuously evolving, shared knowledgebase that will eventually enable near-real-time
learning of new robotic tasks in manufacturing, logistics, and other industries. This, in turn,
will lead to less-labor-intensive, 24/7 operations with significantly shorter lead times,
serving the ultimate vision of Industry 4.0 for lot-size of one production. Our vision can be
22
further expanded to a longer-term research endeavor to: 1) Enhance adaptive learning in
networks of heterogeneous robots; 2) Improve future of work systems with human-in-the-
loop by enabling transfer learning between humans and piece-picking robots; 3) Extend the
developed models and frameworks to other robotic applications beyond manufacturing and
logistics.
23
REFERENCES
Alet, F., Lozano-Pérez, T. and Kaelbling, L. P. (2018) ‘Modular meta-learning’.
Andreas, J. et al. (2016) ‘Neural module networks’, in Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition. doi:
10.1109/CVPR.2016.12.
Andrychowicz, M. et al. (2017) ‘Hindsight experience replay’, Advances in Neural
Information Processing Systems, 2017-Decem(Nips), pp. 5049–5059.
Baldwin, C. Y. and Clark, K. B. (2000) Design rules. Volume 1, The power of
modularity. MIT Press.
Barto, A. G. and Mahadevan, S. (2003) ‘Recent Advances in Hierarchical Reinforcement
Learning’, Discrete Event Dynamic Systems: Theory and Applications. doi:
10.1023/A:1022140919877.
Battaglia, P. W. et al. (2018) ‘Relational inductive biases, deep learning, and graph
networks’, pp. 1–40.
Botvinick, M. et al. (2019) ‘Reinforcement Learning, Fast and Slow’, Trends in
Cognitive Sciences. Elsevier Ltd, 23(5), pp. 408–422. doi: 10.1016/j.tics.2019.02.006.
Brockman, G. et al. (2016) ‘OpenAI Gym’, arXiv:1606.01540v1, pp. 1–4.
Caruana, R. (1997) ‘Multitask Learning’, Machine Learning. doi:
10.1023/A:1007379606734.
Chitnis, R., Kaelbling, L. P. and Lozano-Perez, T. (2019) ‘Learning quickly to plan
quickly using modular meta-learning’, in Proceedings - IEEE International Conference
on Robotics and Automation. doi: 10.1109/ICRA.2019.8794342.
Collobert, R. and Weston, J. (2008) ‘A unified architecture for natural language
processing’, in. doi: 10.1145/1390156.1390177.
Degris, T., White, M. and Sutton, R. S. (2012) ‘Off-policy actor-critic’, in Proceedings of
the 29th International Conference on Machine Learning, ICML 2012.
Deisenroth, M. P. et al. (2014) ‘Multi-task policy search for robotics’, in Proceedings -
IEEE International Conference on Robotics and Automation. doi:
10.1109/ICRA.2014.6907421.
24
Deng, L., Hinton, G. and Kingsbury, B. (2013) ‘New types of deep neural network
learning for speech recognition and related applications: An overview’, in ICASSP, IEEE
International Conference on Acoustics, Speech and Signal Processing - Proceedings. doi:
10.1109/ICASSP.2013.6639344.
Devin, C. et al. (2017) ‘Learning modular neural network policies for multi-task and
multi-robot transfer’, Proceedings - IEEE International Conference on Robotics and
Automation, pp. 2169–2176. doi: 10.1109/ICRA.2017.7989250.
Duan, Y. et al. (2016) ‘RL$^2$: Fast Reinforcement Learning via Slow Reinforcement
Learning’, pp. 1–14.
Finn, C., Abbeel, P. and Levine, S. (2017) ‘Model-agnostic meta-learning for fast
adaptation of deep networks’, 34th International Conference on Machine Learning,
ICML 2017, 3, pp. 1856–1868.
Frans, K. et al. (2018) ‘Meta learning shared hierarchies’, in 6th International
Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings.
García, J. and Shafie, D. (2020) ‘Teaching a humanoid robot to walk faster through Safe
Reinforcement Learning’, Engineering Applications of Artificial Intelligence. Elsevier
Ltd, 88(November 2019), p. 103360. doi: 10.1016/j.engappai.2019.103360.
Gianetto, D. A. and Heydari, B. (2015) ‘Network Modularity is essential for evolution of
cooperation under uncertainty’, Scientific Reports. Nature Publishing Group, 5(1), p.
9340. doi: 10.1038/srep09340.
Girshick, R. (2015) ‘Fast R-CNN’, in Proceedings of the IEEE International Conference
on Computer Vision. doi: 10.1109/ICCV.2015.169.
Gupta, A. et al. (2018) ‘Meta-reinforcement learning of structured exploration strategies’,
Advances in Neural Information Processing Systems, 2018-Decem, pp. 5302–5311.
Harlow, H. F. (1949) ‘The formation of learning sets’, Psychological Review. doi:
10.1037/h0062474.
Van Hasselt, H., Guez, A. and Silver, D. (2016) ‘Deep reinforcement learning with
double Q-Learning’, in 30th AAAI Conference on Artificial Intelligence, AAAI 2016.
Hessel, M. et al. (2019) ‘Multi-Task Deep Reinforcement Learning with PopArt’,
Proceedings of the AAAI Conference on Artificial Intelligence. doi:
10.1609/aaai.v33i01.33013796.
25
Heydari, B. and Dalili, K. (2015) ‘Emergence of modularity in system of systems:
Complex networks in heterogeneous environments’, IEEE Systems Journal, 9(1), pp.
223–231. doi: 10.1109/JSYST.2013.2281694.
Heydari, B., Mosleh, M. and Dalili, K. (2016) ‘From Modular to Distributed Open
Architectures: A Unified Decision Framework’. doi: 10.1002/sys.21348.
Hofmann, E. and Rüsch, M. (2017) ‘Industry 4.0 and the current status as well as future
prospects on logistics’, Computers in Industry. Elsevier B.V., 89, pp. 23–34. doi:
10.1016/j.compind.2017.04.002.
Hou, Y. et al. (2017) ‘A novel DDPG method with prioritized experience replay’, in 2017
IEEE International Conference on Systems, Man, and Cybernetics, SMC 2017. doi:
10.1109/SMC.2017.8122622.
Konda, V. R. and Tsitsiklis, J. N. (2000) ‘Actor-critic algorithms’, in Advances in Neural
Information Processing Systems.
Kusiak, A. (2018) ‘Smart manufacturing’, International Journal of Production Research.
Taylor & Francis, 56(1–2), pp. 508–517. doi: 10.1080/00207543.2017.1351644.
Lasi, H. et al. (2014) ‘Industry 4.0’, Business and Information Systems Engineering. doi:
10.1007/s12599-014-0334-4.
Levine, S. et al. (2016) ‘End-to-end training of deep visuomotor policies’, Journal of
Machine Learning Research.
Lillicrap, T. P. et al. (2016) ‘Continuous control with deep reinforcement learning’, in 4th
International Conference on Learning Representations, ICLR 2016 - Conference Track
Proceedings.
Lillicrap, T. P. et al. (2019) ‘Continuous control with deep reinforcement learning’,
arXiv:1509.02971 Help | Advanced Search.
Lu, Y. (2017) ‘Industry 4.0: A survey on technologies, applications and open research
issues’, Journal of Industrial Information Integration. Elsevier Inc., 6, pp. 1–10. doi:
10.1016/j.jii.2017.04.005.
Mishra, N. et al. (2018) ‘A simple neural attentive meta-learner’, 6th International
Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings,
pp. 1–17.
Mnih, V. et al. (2015) ‘Human-level control through deep reinforcement learning’,
26
Nature. Nature Publishing Group, 518(7540), pp. 529–533. doi: 10.1038/nature14236.
Mnih, V. et al. (2016) ‘Asynchronous methods for deep reinforcement learning’, in 33rd
International Conference on Machine Learning, ICML 2016.
Monostori, L. et al. (2016) ‘Cyber-physical systems in manufacturing’, CIRP Annals -
Manufacturing Technology, 65(2), pp. 621–641. doi: 10.1016/j.cirp.2016.06.005.
Moore, W. L., Louviere, J. J. and Verma, R. (1999) ‘Using Conjoint Analysis to Help
Design Product Platforms’, Journal of Product Innovation Management. doi:
10.1111/1540-5885.1610027.
Mosleh, M. and Heydari, B. (2017) ‘Fair topologies: Community structures and network
hubs drive emergence of fairness norms’, Scientific Reports. doi: 10.1038/s41598-017-
01876-0.
‘Neural relational inference with fast modular meta learning’ (no date).
Nolfi, S. (1997) ‘Using Emergent Modularity to Develop Control Systems for Mobile
Robots’, Adaptive Behavior. Sage PublicationsSage CA: Thousand Oaks, CA, 5(3–4), pp.
343–363. doi: 10.1177/105971239700500306.
OpenAI et al. (2019) ‘Dota 2 with Large Scale Deep Reinforcement Learning’.
Peter, H. W. and Chomsky, N. (1968) ‘Aspects of the Theory of Syntax’, The Modern
Language Review. doi: 10.2307/3722650.
Pritzel, A. et al. (2017) ‘Neural episodic control’, 34th International Conference on
Machine Learning, ICML 2017, 6, pp. 4320–4331.
Ramsundar, B. et al. (2015) ‘Massively Multitask Networks for Drug Discovery’, (Icml).
Ritter, S. et al. (2018) ‘Been there, done that: Meta-learning with episodic recall’, 35th
International Conference on Machine Learning, ICML 2018, 10(1), pp. 6929–6938.
Russell, S. and Zimdars, A. L. (2003) ‘Q-Decomposition for Reinforcement Learning
Agents’, in Proceedings, Twentieth International Conference on Machine Learning.
Schaul, T. et al. (2016) ‘Prioritized experience replay’, in 4th International Conference
on Learning Representations, ICLR 2016 - Conference Track Proceedings.
Schulman, J. et al. (2017) ‘Proximal Policy Optimization Algorithms’, pp. 1–12.
Silver, D. et al. (2014) ‘Deterministic policy gradient algorithms’, in 31st International
Conference on Machine Learning, ICML 2014.
Silver, D. et al. (2016) ‘Mastering the game of Go with deep neural networks and tree
27
search’, Nature. doi: 10.1038/nature16961.
Silver, D. et al. (2017) ‘Mastering the game of Go without human knowledge’, Nature.
doi: 10.1038/nature24270.
Simon, H. A. (1991) ‘The Architecture of Complexity’, in Facets of Systems Science.
Boston, MA: Springer US, pp. 457–476. doi: 10.1007/978-1-4899-0718-9_31.
Simpkins, C. and Isbell, C. (2019) ‘Composable Modular Reinforcement Learning’,
Proceedings of the AAAI Conference on Artificial Intelligence. doi:
10.1609/aaai.v33i01.33014975.
Sprague, N. and Ballard, D. (2003) ‘Multiple-goal reinforcement learning with modular
sarsa(O)’, in IJCAI International Joint Conference on Artificial Intelligence.
Sullivan, K. J. et al. (2001) ‘The structure and value of modularity in software design’, in
Proceedings of the 8th European software engineering conference held jointly with 9th
ACM SIGSOFT international symposium on Foundations of software engineering -
ESEC/FSE-9. New York, New York, USA: ACM Press, p. 99. doi:
10.1145/503209.503224.
Sutton, R. S. et al. (2000) ‘Policy gradient methods for reinforcement learning with
function approximation’, in Advances in Neural Information Processing Systems.
Tamar, A. et al. (2016) Learning from the Hindsight Plan -- Episodic MPC Improvement.
Taylor, M. E. and Stone, P. (2009) ‘Transfer Learning for Reinforcement Learning
Domains: A Survey’, Journal of Machine Learning Research, 10, pp. 1633–1685.
Teh, Y. W. et al. (2017) ‘Distral: Robust multitask reinforcement learning’, in Advances
in Neural Information Processing Systems.
Todorov, E., Erez, T. and Tassa, Y. (2012) ‘MuJoCo: A physics engine for model-based
control’, in IEEE International Conference on Intelligent Robots and Systems. doi:
10.1109/IROS.2012.6386109.
Uchibe, E., Asada, M. and Hosoda, K. (1996) ‘Behavior coordination for a mobile robot
using modular reinforcement learning’, in IEEE International Conference on Intelligent
Robots and Systems. doi: 10.1109/iros.1996.568989.
Uhlenbeck, G. E. and Ornstein, L. S. (1930) ‘On the theory of the Brownian motion’,
Physical Review. doi: 10.1103/PhysRev.36.823.
Vecerik, M. et al. (2019) ‘A practical approach to insertion with variable socket position
28
using deep reinforcement learning’, in Proceedings - IEEE International Conference on
Robotics and Automation. doi: 10.1109/ICRA.2019.8794074.
Vinyals, O. et al. (2019) ‘Grandmaster level in StarCraft II using multi-agent
reinforcement learning’, Nature. doi: 10.1038/s41586-019-1724-z.
Wang, J. X. et al. (2016) ‘Learning to reinforcement learn’, pp. 1–17.
Wang, W. et al. (2019) ‘Facilitating Human-Robot Collaborative Tasks by Teaching-
Learning-Collaboration from Human Demonstrations’, IEEE Transactions on
Automation Science and Engineering. doi: 10.1109/TASE.2018.2840345.
Wang, Z. et al. (2016) ‘Dueling Network Architectures for Deep Reinforcement
Learning’, in 33rd International Conference on Machine Learning, ICML 2016.
Weiss, K., Khoshgoftaar, T. M. and Wang, D. D. (2016) ‘A survey of transfer learning’,
Journal of Big Data. doi: 10.1186/s40537-016-0043-6.
Wixson, L. E. (1991) ‘Scaling Reinforcement Learning Techniques via Modularity’, in
Machine Learning Proceedings 1991. doi: 10.1016/b978-1-55860-200-7.50076-3.
Yu, T. et al. (2019) ‘Meta-World: A Benchmark and Evaluation for Multi-Task and Meta
Reinforcement Learning’, (CoRL).
Zhang, Y. and Yang, Q. (2017) ‘A Survey on Multi-Task Learning’, pp. 1–20.
Zhong, R. Y. et al. (2017) ‘Intelligent Manufacturing in the Context of Industry 4.0: A
Review’, Engineering, 3(5), pp. 616–630. doi: 10.1016/J.ENG.2017.05.015.
RightHandRobotics.Understanding the 3Rs of Robotic Piece-Picking, 2018. URL:
https://www.mmh.com/article/understanding_the_3rs_of_robotic_piece_picking.