c Consult author(s) regarding copyright matters · Carlos Sampedro 1, Alejandro Rodriguez-Ramos ,...
Transcript of c Consult author(s) regarding copyright matters · Carlos Sampedro 1, Alejandro Rodriguez-Ramos ,...
This may be the author’s version of a work that was submitted/acceptedfor publication in the following source:
Sampedro, Carlos, Rodriguez-Ramos, Alejandro, Gil, Ignacio, Mejias Al-varez, Luis, & Campoy, Pascual(2018)Image-based visual servoing controller for multirotor aerial robots usingdeep reinforcement learning.In Laschi, C (Ed.) Proceedings of the 2018 IEEE/RSJ International Con-ference on Intelligent Robots and Systems.IEEE, United States of America, pp. 979-986.
This file was downloaded from: https://eprints.qut.edu.au/120147/
c© Consult author(s) regarding copyright matters
This work is covered by copyright. Unless the document is being made available under aCreative Commons Licence, you must assume that re-use is limited to personal use andthat permission from the copyright owner must be obtained for all other uses. If the docu-ment is available under a Creative Commons License (or other specified license) then referto the Licence for details of permitted re-use. It is a condition of access that users recog-nise and abide by the legal requirements associated with these rights. If you believe thatthis work infringes copyright please provide details by email to [email protected]
Notice: Please note that this document may not be the Version of Record(i.e. published version) of the work. Author manuscript versions (as Sub-mitted for peer review or as Accepted for publication after peer review) canbe identified by an absence of publisher branding and/or typeset appear-ance. If there is any doubt, please refer to the published source.
https://doi.org/10.1109/IROS.2018.8594249
Image-Based Visual Servoing Controller for Multirotor Aerial Robots
Using Deep Reinforcement Learning
Carlos Sampedro1, Alejandro Rodriguez-Ramos1, Ignacio Gil1, Luis Mejias2 and Pascual Campoy1
Abstract— In this paper, we propose a novel Image-BasedVisual Servoing (IBVS) controller for multirotor aerial robotsbased on a recent deep reinforcement learning algorithm namedDeep Deterministic Policy Gradients (DDPG). The proposedRL-IBVS controller is successfully trained in a Gazebo-basedsimulated environment in order to learn the appropriate IBVSpolicy for directly mapping a state, based on errors in the image,to the linear velocity commands of the aerial robot. A thoroughvalidation of the proposed controller has been conducted insimulated and real flight scenarios, demonstrating outstandingcapabilities in object following applications. Moreover, weconduct a detailed comparison of the RL-IBVS controller withrespect to classic and partitioned IBVS approaches.
I. INTRODUCTION
Visual servoing is a mature technique, which has been
extensively studied in the literature over the years. Despite
many efforts, the classic approach to map visual errors to
actuator commands still has limitations in terms of conver-
gence, stability and gain design. In the field of aerial robotics,
visual servoing has been used in a wide range of applications
such as object following [1], [2], [3], object grasping and
manipulation [4], [5], landing [6], and inspection tasks [7],
[8], [9], among others. In many cases, technique success has
been achieved by linearizing states around an operating point
(i.e hover or near hover).
In its two variations, traditional visual servoing strategies,
i.e, Position-Based Visual Servoing (PBVS) and Image-
Based Visual Servoing (IBVS) have achieved acceptable
performance, however notable convergence and stability
problems [10], [11] can appear if the task is performed in a
zone away from the operating point (or linearization point).
Some of these problems are related to the calculation of the
image Jacobian, also called interaction matrix, which defines
a crucial role in the implementation of the control law, but
also to the choice of an appropriate gain that provides a good
tradeoff between convergence and stability. Other important
problems are those related to local minima solutions owing
to unrealizable image motions. Moreover, the tuning process
of the different parameters of the control model can become
an onerous and hard procedure in most of the applications.
Inspired by the aforementioned limitations, in this paper
we propose a novel IBVS controller based on a recent deep
reinforcement learning algorithm named Deep Deterministic
Policy Gradients (DDPG) [12]. Taking as input the errors in
1Computer Vision and Aerial Robotics group, Centre for Automationand Robotics, Universidad Politecnica de Madrid (UPM-CSIC), Calle JoseGutierrez Abascal 2, Madrid (Spain). [email protected]
2 School of Electrical Engineering and Computer Science, QueenslandUniversity of Technology, Brisbane QLD 4000, Australia.
the image plane with respect to a predefined reference, the
agent is successfully trained in order to generate the appropri-
ate linear velocity commands in the x, y and z directions for
controlling the aerial robot. The training process of the agent
has been conducted using our own reinforcement learning
framework, which integrates recent deep reinforcement learn-
ing algorithms with an aerial robotics simulator (i.e. RotorS
Gazebo [13]). In order to infer velocity commands, the
agent’s state is composed of the errors in the image plane and
their respective derivatives. Furthermore, the reward function
is carefully designed in order to obtain a smooth behavior of
the agent while performing object following missions. These
missions have been conducted in simulated and real flight
experiments, demonstrating the appropriate capabilities of
the proposed RL-IBVS algorithm.
The main contributions of the presented work are summa-
rized here:
• A novel IBVS controller for multirotor aerial robots
based on deep reinforcement learning is proposed. The
RL-IBVS controller is completely trained in simulation
and tested in simulated and real scenarios.
• An extensive validation and comparison of the proposed
RL-IBVS approach with respect to classic IBVS con-
trollers are conducted in simulated and real flights.
The remainder of this paper is structured as follows:
Section II provides an overview of the related work, Section
III introduces the formulation of the proposed RL-IBVS
controller. The experiments conducted in simulated and real
scenarios are presented in Section IV. Results are discussed
in Section V, and finally Section VI concludes the paper as
well as points out future research directions.
II. RELATED WORK
Recent literature on visual servoing and machine learning
have focused on learning features and predictive models
under the impression that traditional visual servoing have
directed most efforts at designing stable and convergent
controllers (assuming such features are available). We see
the opposite statement, nowadays, there is a suite of ro-
bust descriptors and detectors that can work in challenging
lighting conditions, therefore a problem still remains on
designing controllers that are robust and stable in a large task
space (not just about the linearization point). The problem
becomes amplified when we consider aerial platforms such
as quadrotors. One of the pioneering works on the use of
reinforcement learning for aerial robots is [14], where a
reinforcement learning (optimal control) algorithm was used
CONFIDENTIAL. Limited circulation. For review only.
Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.
to find a controller that is optimized for specific aerial ma-
neuvers. Recent efforts on deep learning and visual servoing
include Lee et al. [15] in which a Q-iteration algorithm is
used to learn which features are best suited for the task
at hand, however ultimately their optimization function is
solved efficiently by the linearization of the dynamics around
an operating point. Convolutional neural networks (CNNs)
have been popular in many areas of robotics, science and
engineering. For instance, Bateux et al. [16] have addressed
the visual servoing problem by training a CNN to estimate
the relative pose between the current and reference image,
then having this relative translation and rotation a classic
visual servoing control law is used [17]. Saxena et al. [18]
similarly approach the problem using CNNs as well, however
additional experimentation is conducted on a quadrotor using
a standard servoing control approach.
In terms of control and navigation, reinforcement learning
has also been used to learn navigation routines for a quadro-
tor [19], control and stabilization [20], target tracking [21]
and landing [22], not explicitly addressing the visual servoing
task with machine learning. Closest work to the one pre-
sented here is Shi et al. [23] in which two gains (translation
and rotation) are learned to position a quadrotor relative to a
target using fuzzy Q-learning. From previous literature, little
attention has been placed on learning gains for robustness
and stability. This is a standing issue in many decoupled
IBVS approaches for aerial robotics.
III. IMAGE-BASED VISUAL SERVOING
CONTROLLERS
A. Classic and Partitioned IBVS Controllers
The use of IBVS approaches with aerial robots is a chal-
lenging task. Classic IBVS as proposed in the literature [17]
establishes that: ν = −λL+s e will drive linear and angular
velocity references to zero as the features s move towards
their desired locations s∗, being the error: e = s− s∗. This
approach requires a minimum of four points, but due to the
estimation of varying 3-D parameters during the visual servo,
the term L+s can be difficult to estimate or approximate. An
alternative approach detailed in [24], proposes that L+s =
1
2(Ls + L∗
s)+ where L∗
s is the interaction matrix at the
desired feature location. Better convergence capabilities than
classic IBVS has been proven in this work. In this paper, we
use the approach described in [24] as one of the baselines
techniques (referred to as classic IBVS) to compare the
performance of our proposed RL-IBVS. Furthermore, we
enhance the classic IBVS by introducing an adaptive gain
λ(x) = a exp(−bx)+c, where a = λ(0)−λ(∞), b = λ(0)/aand c = λ(∞) are constant parameters and x is the infinity
norm of the task Jacobian to consider.
For aerial vehicles, a common way to deal with underac-
tuated systems such as quadrotors is to decouple the visual
servoing task [25]. A decoupled IBVS approach aims at
separating translational and rotational degrees of freedom
in the control task. A detailed overview of this technique
is presented in [25], [26]. Inspired by the decoupled ap-
proaches, the partitioned IBVS emerges, which exhibits sim-
ilar decoupling properties but using only features expressed
in the image plane. The formulation of the partitioned IBVS
is based on the decoupling of the motion related to the zaxis from the x and y axes motion. For this purpose, the
interaction matrix is partitioned such that:
s = Lsν (1)
= Lxyvxy + Lzvz (2)
= sxy + sz (3)
where Lxy = columns{1, 2, 4, 5} of Ls and Lz =columns{3, 6} of Ls, and νxy = (vx, vy, ωx, ωy) and
νz = (vz, ωz). ν = (v, ω) = (vx, vy, vz, ωx, ωy, ωz) ∈ ℜ6
is the vector that contains linear and angular velocities of
the camera and s are the feature velocities in the image,
which coordinates are defined by s = (u, v) with their
desired values being s∗ = (u∗, v∗). Then, the control law
(ν = −λL+e), with ei = si − s∗i and λ a positive gain,
becomes:
νxy = −L+xy(λe+ Lzνz) (4)
Assuming an operating point (linearization point -near
hover-), (4) can be simplified, indeed, obtaining linear and
angular velocity commands that can be sent separately and
simultaneously to the quadrotor in order to achieve transla-
tion and rotation. For instance, rotation can be controlled by
ωz = λωz(ψ∗
t − ψt) where ψt and ψ∗
t are the orientation of
the target in the image and its desired value, respectively.
Altitude can be controlled by vz = λvz lnσ∗
σwhere σ and
σ∗ are the area and its desired value of the polygon enclosed
by the set of features s. In this work, we use the partitioned
IBVS approach as another baseline technique in order to
evaluate our RL-IBVS.
B. Reinforcement Learning IBVS Controller
1) Background: The principal components in a reinforce-
ment learning problem are the agent and the environment.
The agent interacts at each time step with the environment
by executing an action and receiving an observation and a
reward from the latter. The final goal of the agent is to
learn the optimal policy in order to maximize the cumulative
future reward: Rt =∑T
i=t γi−tr(si, ai) given the reward
function r(si, ai), state si, the action ai, and a discount
factor γ ∈ [0, 1]. In reinforcement learning, the policy is
traditionally evaluated using the Bellman equation (5), which
computes the expected return once the agent executes an
action at at state st and then follows policy π(a|s) thereafter.
Qπ(st, at) = Eπ[Rt|st, at] (5)
where Qπ(st, at) is the action-value function.
Recently, deep reinforcement learning methods have uti-
lized neural networks to learn the policy as well as to
approximate the action-value function [12], [27], [28]. In [28]
two key techniques were introduced in order to efficiently use
neural networks in the reinforcement learning paradigm: the
experience replay, which helps for breaking the correlation
between observations, and the use of target networks in order
CONFIDENTIAL. Limited circulation. For review only.
Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.
RotorS
Gazebo
Environment
Interface
(ROS)
Environment
Agent
Interface
(ROS)
Velocity
Controller
RL-IBVS
Object
Detector
State
(ex, ey, e�, dex, dey, de�)
State
ROI
Reward
Image
Image
Action
Bottom
Camera
ImageAction
(ax, ay, az)
Objects
State
DDPG
Critic
Actor
Reward
Agent
Action
(ax, ay, az)
Action
(ax, ay, az)
Fig. 1: Architecture of the proposed RL-IBVS system. The enviroment receives as input a ROI in the image plane computed
by an object detector. The agent which implements the DDPG algorithm, receives an observation and a reward from the
environment and computes the action (linear velocity) to be commanded to the velocity controller. The dotted lines represent
interactions between the components in training mode, while the continuous lines depict the interactions in test mode.
to reduce the correlation between the action-value function
(Q) and the target values r + γmaxa′ Q(s′, a′). The last
technique was improved in the DDPG algorithm, where a soft
updating method was introduced to update the parameters of
the target critic and target actor networks.
Regarding the DDPG algorithm, it consists of a model-
free, off-policy algorithm based on an actor-critic architec-
ture. The critic neural network acts as a non-linear function
approximator for estimating the action-value function, while
the neural network represented by the actor is used to
compute the policy that directly maps a state to an action
for the agent.
2) RL-IBVS Architecture: The architecture of the pro-
posed RL-IBVS algorithm has been designed in order to
integrate deep reinforcement learning agents with a Gazebo-
based simulator (e.g. RotorS Gazebo). The Robot Operating
System (ROS) [29] middleware is used to communicate the
agent and the environment (see Fig. 1).
As shown in Fig. 1, the environment receives a Region
Of Interest (ROI) from an object detector algorithm, and
computes the observation at current timestep, which consists
of a 6-dimensional vector defined by (6).
S = {ex, ey, eΦ, dex, dey, deΦ} (6)
where ex, ey represent the normalized error of the center
of the detected ROI with respect to the reference defined
in the image plane, and eΦ is the error in diameter of
the circumscribed circle to the ROI (Φo) with respect to
the reference diameter (Φref ) (see Fig. 2). The variables
dex, dey, deΦ are the normalized derivatives of the error
filtered using a low-pass filter. It has to be remarked here that
these errors in the image plane are computed with respect
to a virtual stabilized image plane. In order to compute
this virtual plane, we apply the transformation in (7) to the
actual ROI taking into account the roll and pitch information
provided by the Inertial Measurement Unit (IMU) of the
aerial robot. This stabilization procedure has been found to
be crucial in the learning process of the agent.
p′ = KRK−1p (7)
where p represents a point in the actual image plane,
p′ is the point in the virtual stabilized image plane, K is
the projection matrix and R is the rotation matrix which
encapsulates the roll and pitch information in the aerial robot
frame of reference (U) expressed in the camera frame (C).
The reference in the image is computed by projecting
the desired 3D points of the target object in the image
plane using the intrinsic camera parameters and considering a
pinhole camera model. In this work, the desired 3D position
of the robot with respect to the object is always considered
centered with respect to the bottom camera frame (C) and at
the desired height above the object.
The agent, which implements the DDPG algorithm, gen-
erates a 3-dimensional continuous action (8) computed by
the actor network and receives an 6-dimensional observation
(6) and a reward (10) from the environment.
A = {ax, ay, az} (8)
where ax, ay, az are the linear velocities commanded to
the velocity controller [30] referred in the world frame (W).
The actor and critic neural networks that compose the
DDPG agent consist of feed-forward neural networks with
two hidden layers of 400 and 300 units. The activation
function utilized in each hidden layer is the Rectified Linear
Unit (ReLU). The output layer of the actor network is
composed of 3 units with a tanh activation function bounded
to [−0.5, 0.5] in order to generate the actions commanded to
the velocity controller. The output layer of the critic network
CONFIDENTIAL. Limited circulation. For review only.
Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.
consists of one unit with a linear activation function in order
to estimate the Q-value function.
3) Reward Function: The reward function implemented
in the environment’s side of the RL-IBVS architecture is
designed in order to evaluate the performance of the agent
in two consecutive time steps according to (10). For this
purpose a shaping function [31] is introduced. As can be
seen in (9), the shaping term penalizes big values of the state
components, that is, we penalize the agent when the errors
computed in the image plane are big. Furthermore, we add
an extra term to the shaping function which penalizes big
values of the actions executed by the agent with the purpose
of obtaining a smoother behavior.
shapingt = −α√
e2x + e2y − β√
de2x + de2y
−γ|eΦ| − δ|deΦ| − ǫ√
a2x + a2y + a2z
(9)
rt = shapingt − shapingt−1 (10)
where rt is the reward obtained at time step t, and
α, β, γ, δ are positive gains that control the penalization in
the error in position and diameter in the image plane together
with their derivatives. ǫ represents the penalization in the
action term. These gains have been empirically obtained,
being α = 100, β = 10, γ = 100, δ = 10, and ǫ = 1.
yW
xW
zW
xCyC zC
v
u
eX
�ref
ey
Image Plane
‹W›
‹C›
yU xU
zU
‹U›
fc
�o
Fig. 2: Notation and reference frames utilized in this work.
Where W , C and U stand for world, camera and aerial robot
frames respectively. The variables used for computing the
state are also depicted in the image plane.
IV. EXPERIMENTS AND RESULTS
A video demonstration of the reported experiments and
results is provided with this manuscript in: https://
vimeo.com/258063350.
A. Experimental Setup
In order to perform simulated and real flight experiments,
the proposed RL-IBVS architecture has been integrated
within the Aerostack framework [32], which is built on top of
the ROS middleware, and provides the necessary additional
software components in order to conduct autonomous flights
(e.g. state estimator, velocity controller, etc).
Two main simulation scenarios have been created using
RotorS Gazebo simulator. The first scenario presented in Fig.
4a is used for training the RL-IBVS agent and is composed of
an AscTec Hummingbird quadrotor representing the robotic
agent, and a cylindrical object utilized as the target for
visual servoing purposes. The Hummingbird quadrotor is
equipped with a fisheye-lens bottom-looking camera with a
horizontal field of view of 132◦ and an image size of 640× 480 pixels. For training the RL-IBVS agent, TensorFlow1
library on a GPU Nvidia GeForce GTX 970 is utilized under
Ubuntu 16.04 operating system. A second Gazebo simulation
scenario (see Fig. 4b) has been created in order to test the
RL-IBVS agent and compare it with the IBVS approaches
presented in Section III-A.
Several real flight experiments have also been conducted
in order to test the capabilities of the proposed IBVS
approaches on a real aerial robotic platform. For this purpose,
we use a Parrot Bebop 2 as the aerial robotic platform
equipped with a Parrot S.L.A.M. dunk (see Fig. 7a), utilized
in this work only for obtaining velocities estimation, and
a camera with an image size of 856 × 480 pixels. An
OptiTrack motion capture system is used in the first real
flight experiment only for gathering ground truth data relative
to the state of the robot and the target. In order to obtain
a reliable comparison of the different IBVS approaches, in
all the experiments we limit the commanded actions to the
range [−0.5, 0.5] m/s and introduce them to the same velocity
controller [30].
B. Training Methodology
Taking into account the aforementioned experimental
setup and using the scenario presented in Fig. 4a, the RL-
IBVS agent is trained using an episodic RL setting. In each
episode, the agent is placed in a random position within an
area of 2 m × 2 m × 1.7 m while the target object is static
always at the origin. In each time step, the agent which runs
at 20 Hz executes an action with added noise according to
an Ornstein-Uhlenbeck distribution with the final objective of
being completely centered on the target object at the desired
altitude with respect to it. The episode is considered finished
when a terminal state is reached. This situation can occur
due to three main events: the target object is out of the
image boundaries, the aerial robot is out of the altitude limits
(< 0.7 m or > 1.7 m), or the maximum number of steps per
episode is reached. When any of the first two events occurs,
the agent is rewarded negatively with -100. In the rest of
the time steps, the agent is rewarded according to (10). The
maximum number of steps per episode utilized is 4000.
1https://www.tensorflow.org/
CONFIDENTIAL. Limited circulation. For review only.
Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.
600 800 1000 1200 1400
Episode
-150
-100
-50
0
50
Accu
m R
ew
ard
(a)
600 800 1000 1200 1400
Episode
-5
0
5
10
Accu
m Q
-Va
lue
104
(b)
Fig. 3: Training curves. (a) Accumulated reward per episode.
(b) Accumulated action-value function (Q) per episode. Each
episode is composed of 4000 simulation steps.
(a)
(b)
Fig. 4: Gazebo simulation environments. (a) Environment
created for training the RL-IBVS agent. (b) Environment
designed for evaluating the IBVS approaches using a leader-
follower configuration.
During the training process of the actor’s and critic’s neu-
ral networks, Adam optimizer has been utilized with a base
learning rate of 10−4 for the actor and 10−3 for the critic,
and a minibatch size of 64. Using this configuration, the
training process of the RL-IBVS agent took approximately
17 hours, learning a stable behavior around episode 1400,
which can be reviewed in the training curves presented in
Fig. 3. In this figure, we only show the behavior of the
agent from episode 600 for clarity. From episode 0 to 800 the
agent is continuously giving very high positive and negative
commands in altitude reaching a terminal state in each time
step. This fact produces the Q-value curve to have values
around zero until the episode 800.
C. Simulation Experiments
The aim of this section is to validate the proposed RL-
IBVS controller and compare it with the IBVS approaches
presented in Section III-A. For this purpose, a leader-follower
experiment using two aerial robots has been designed in
RotorS Gazebo simulator (see Fig. 4b). In the proposed
experiment, an AscTec Firefly quadrotor is used as the
leader and an AscTec Hummingbird acts as the follower.
The leader performs a rhomboidal trajectory (top view) with
an ascending or descending movement depending on the side
of the rhomboid (see Figures 5c, 5f and 5i) at v = 0.33 m/s,
and hovers during several simulation steps in its corners in
order to evaluate the stabilization of each IBVS controller.
The results obtained during the execution of the simulation
experiment are shown in Fig. 5. The results presented in Fig.
5 are summarized on Table I which shows the mean and the
standard deviation for each error variable.
TABLE I: Mean and standard deviation errors in pixels
obtained for the simulation flight experiment of Fig. 5.
IBVS approach ex ey eΦ
Classic 1.6± 24.2 −2.5± 27.0 1.4± 9.1
Partitioned −0.2± 21.2 −1.2± 27.1 0.6± 6.2
RL 5.4± 40.2 2.0± 25.1 −0.03± 15.7
D. Real Flight Experiments
Two main real flight experiments have been designed in
order to evaluate the capabilities of the IBVS approaches
presented in this work onboard a real aerial robotic plat-
form. The first real flight experiment has been designed
as a structured 4 m × 2 m indoor scenario in which we
recreate the leader-follower configuration of the simulation
experiment. In this experiment, the quadrotor leader has been
substituted by a human operator carrying a target marker
in order to obtain a smoother trajectory of the target for
improving the comparative analysis. The human operator
performs a continuous rectangular trajectory (top view) with
an ascending and descending movement in each long side of
the rectangle (see Figures 6c, 6f and 6i). We have considered
this configuration as the most suitable due to the high
turbulences experimented when flying two small quadrotors
one on top of the other. The results obtained during the
execution of the experiment are shown in Fig. 6. A summary
of the mean errors with their respective standard deviation
for the experiment of Fig. 6 are shown on Table II.
TABLE II: Mean and standard deviation errors in pixels
obtained for the real flight experiment of Fig. 6.
IBVS approach ex ey eΦ
Classic 19.9± 94.7 15.8± 72.9 −0.3± 7.3
Partitioned −2.2± 66.6 12.8± 58.1 −1.8± 5.7
RL −10.9± 52.5 2.5± 23.0 −5.2± 13.2
CONFIDENTIAL. Limited circulation. For review only.
Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.
0 20 40 60 80 100
Time (s)
-150
-100
-50
0
50
100
150
Err
or
xy (
pix
els
)
ex
ey
(a)
0 20 40 60 80 100
Time (s)
-30
-20
-10
0
10
20
30
40
Err
or
(pix
els
)
(b)
06
4 4
2
Z (
m)
2 2
Y (m) X (m)
0
4
0-2 -2
-4 -4
Leader
Follower
0
1
2
3
(c)
0 20 40 60 80 100
Time (s)
-150
-100
-50
0
50
100
150
Err
or
xy (
pix
els
)
ex
ey
(d)
0 20 40 60 80 100
Time (s)
-30
-20
-10
0
10
20
30
40
Err
or
(pix
els
)
(e)
06
4 4
2
Z (
m)
2 2
Y (m) X (m)
0
4
0-2 -2
-4 -4
Leader
Follower
0
1
2
3
(f)
0 20 40 60 80 100
Time (s)
-150
-100
-50
0
50
100
150
Err
or
xy (
pix
els
)
ex
ey
(g)
0 20 40 60 80 100
Time (s)
-30
-20
-10
0
10
20
30
40
Err
or
(pix
els
)
(h)
06
4 4
2
Z (
m)
2 2
Y (m) X (m)
0
4
0-2 -2
-4 -4
Leader
Follower
0
1
2
3
(i)
Fig. 5: Simulation flight experiment for a leader-follower configuration. The leader executes a rhomboidal trajectory (top
view) hovering during some simulation steps at each corner. (a), (b), (c) Results obtained with the classic IBVS controller.
(d), (e), (f) Results acquired using the partitioned IBVS. (g), (h), (i) Results obtained for the proposed RL-IBVS approach.
The second real flight experiment has been conceived as
a visual servoing experiment using a moving car as the
object to follow which performs an oval trajectory in an
indoor scenario of 8 m × 8 m (see Fig. 7). The aim of
this experiment is to evaluate the capabilities of the RL-
IBVS approach while performing visual servoing on top of
a moving car that can exhibit changes in its trajectory. The
moving car is remotely controlled by a human operator who
performs changes in the forward direction of the car when
it moves along the y axis (see Fig. 7b). Furthermore, several
pillars are placed on the scenario in order to difficult the
movement of the aerial robot.
V. DISCUSSION
The experiments and results presented in simulated and
real indoor scenarios prove the good behavior exhibited by
the three IBVS controllers presented in this work. Remark-
able results are obtained in the case of the partitioned IBVS
and the proposed RL-IBVS having mean errors less than 6pixels in the simulation flight experiment of Fig. 5 and 13pixels in the real flight experiment presented in Fig. 6. More
notorious is the transition of the RL-IBVS from simulation
to real flight tests. Unlike classic and partitioned IBVS
approaches, in which a considerable effort was made on
tunning the parameters, the RL-IBVS exhibited a really quick
and almost direct transition to real flight experiments, which
is evidenced by the results presented on Tables I and II. In
both tables, the errors obtained for the RL-IBVS are within
the same range demonstrating the good transition capabilities
of the proposed approach. Moreover, these results are of
great interest taking into account that the RL-IBVS agent
was trained on simulation using an AscTec Hummingbird
quadrotor and tested on real flights using a Parrot Bebop 2.
This fact enhances the versatility of the proposed RL-IBVS.
This versatility is also validated in the results presented in
Fig. 7b. In this figure, it can be noticed the stable trajectory
performed by the RL-IBVS agent even in presence of sudden
movements of the car and false positives in the detection of
the target due to reflections on the metallic surface of the
car.
Despite the good capabilities exhibited by the proposed
RL-IBVS approach, it has to be noted the difficult process of
designing the reward function in order to train the agent. We
CONFIDENTIAL. Limited circulation. For review only.
Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.
0 20 40 60
Time (s)
-200
-100
0
100
200
300
Err
or
xy (
pix
els
)
ex
ey
(a)
0 20 40 60
Time (s)
-40
-20
0
20
40
Err
or
(pix
els
)
(b)
0
1
2
2
Z (
m)
3
Y (m)0
X (m)
2-2 0-2-4
Follower
Leader
0
1
2
3
4
5
(c)
0 20 40 60
Time (s)
-200
-100
0
100
200
300
Err
or
xy (
pix
els
)
ex
ey
(d)
0 20 40 60
Time (s)
-40
-20
0
20
40
Err
or
(pix
els
)
(e)
02
1
2
Z (
m)
3
Y (m) 0
2
X (m)
0-2 -2-4
Follower
Leader
0
1
2
3
4
5
(f)
0 20 40 60
Time (s)
-200
-100
0
100
200
300
Err
or
xy (
pix
els
)
ex
ey
(g)
0 20 40 60
Time (s)
-40
-20
0
20
40
Err
or
(pix
els
)
(h)
0
1
2
2
Z (
m)
3
Y (m) 0
X (m)
2-2 0-2-4
Follower
Leader
0
1
2
3
4
5
(i)
Fig. 6: Real flight experiment for a leader-follower configuration. The leader is characterized by a human carrying a target
object. The follower consists in a Parrot Bebop 2 quadrotor. (a), (b), (c) Results obtained with the classic IBVS controller.
(d), (e), (f) Results acquired using the partitioned IBVS. (g), (h), (i) Results obtained for the proposed RL-IBVS approach.
have found the success of training the agent very dependent
on the reward function design, and thus will be considered
as a future research topic. Moreover, we have found the
stabilization of the image as a crucial technique in order
to train and test the agent. Training the agent using directly
the actual detection of the target in the image plane, led to
learning unsuccessful policies.
VI. CONCLUSIONS AND FUTURE WORK
In this paper, we proposed a novel Image-Based Visual
Servoing approach using Deep Reinforcement Learning (RL-
IBVS) for controlling a multirotor aerial robotic platform
using only the errors computed on the image provided
by a bottom-looking camera. The proposed approach has
been extensively compared with two state-of-the-art IBVS
techniques, revealing the good capabilities of the RL-IBVS
in several simulated and real flight scenarios. One of the
key properties of the RL-IBVS is the quick and almost
direct transition from simulation to real flight. Moreover,
this property is enhanced by the fact that the RL-IBVS was
trained only in simulation using a quadrotor aerial robot
(AscTec Hummingbird) with different dynamics as compared
to the one used in real flights (Parrot Bebop 2). Future
work is aimed towards extending the capabilities of the RL-
IBVS in order to directly control the quadrotor in attitude
commands (roll, pitch and yaw). In addition, other deep
reinforcement learning approaches will be explored.
ACKNOWLEDGMENT
The authors would like to thank the Spanish Ministry of
Science MICYT DPI2014-60139-R for project funding. The
LAL UPM and the MONCLOA Campus of International
Excellence are also acknowledged for the funding of the
predoctoral contract of one of the authors.
REFERENCES
[1] J. Pestana, J. L. Sanchez-Lopez, P. Campoy, and S. Saripalli, “Visionbased gps-denied object tracking and following for unmanned aerialvehicles,” in Safety, security, and rescue robotics (ssrr), 2013 ieee
international symposium on. IEEE, 2013, pp. 1–6.[2] J. Thomas, J. Welde, G. Loianno, K. Daniilidis, and V. Kumar,
“Autonomous flight for detection, localization, and tracking of movingtargets with a small quadrotor,” IEEE Robotics and Automation Letters,vol. 2, no. 3, pp. 1762–1769, 2017.
CONFIDENTIAL. Limited circulation. For review only.
Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.
(a)
08
-46
-24
Y (m)X (m)
02
2 0
-24
1
Z (
m) 2
3
A
A
B
B
(b)
Fig. 7: Real flight experiment with a remotely controlled car
as the object to follow. (a) Indoor scenario created for the
experiment. (b) Trajectory generated by the aerial robot (red
color) while following the car (green cuboid). The arrows
depict the movement of the car.
[3] S. Li, T. Liu, C. Zhang, D.-Y. Yeung, and S. Shen, “Learningunmanned aerial vehicle control for autonomous target following,”arXiv preprint arXiv:1709.08233, 2017.
[4] J. Thomas, G. Loianno, K. Sreenath, and V. Kumar, “Toward imagebased visual servoing for aerial grasping and perching,” in Robotics
and Automation (ICRA), 2014 IEEE International Conference on.IEEE, 2014, pp. 2113–2118.
[5] V. Lippiello, J. Cacace, A. Santamaria-Navarro, J. Andrade-Cetto,M. A. Trujillo, Y. R. Esteves, and A. Viguria, “Hybrid visual servoingwith hierarchical task composition for aerial manipulation,” IEEE
Robotics and Automation Letters, vol. 1, no. 1, pp. 259–266, 2016.
[6] D. Lee, T. Ryan, and H. J. Kim, “Autonomous landing of a vtoluav on a moving platform using image-based visual servoing,” inRobotics and Automation (ICRA), 2012 IEEE International Conference
on. IEEE, 2012, pp. 971–976.
[7] N. Metni and T. Hamel, “A uav for bridge inspection: Visual servoingcontrol law with orientation limits,” Automation in construction,vol. 17, no. 1, pp. 3–10, 2007.
[8] I. Sa, S. Hrabar, and P. Corke, “Inspection of pole-like structuresusing a vision-controlled vtol uav and shared autonomy,” in Intelli-
gent Robots and Systems (IROS 2014), 2014 IEEE/RSJ International
Conference on. IEEE, 2014, pp. 4819–4826.
[9] O. Araar and N. Aouf, “Visual servoing of a quadrotor uav for au-tonomous power lines inspection,” in Control and Automation (MED),
2014 22nd Mediterranean Conference of. IEEE, 2014.
[10] B. Espiau, “Effect of camera calibration errors on visual servoing inrobotics,” in Experimental robotics III. Springer, 1994, pp. 182–192.
[11] F. Chaumette, “Potential problems of stability and convergence inimage-based and position-based visual servoing,” in The confluence
of vision and control. Springer, 1998, pp. 66–78.[12] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforce-ment learning,” arXiv preprint arXiv:1509.02971, 2015.
[13] F. Furrer, M. Burri, M. Achtelik, and R. Siegwart, “Rotors — amodular gazebo mav simulator framework,” Robot Operating System
(ROS), vol. 625, p. 595, 2016.[14] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng, “An application of
reinforcement learning to aerobatic helicopter flight,” in Advances in
neural information processing systems, 2007, pp. 1–8.[15] A. X. Lee, S. Levine, and P. Abbeel, “Learning visual servoing with
deep features and fitted q-iteration,” CoRR, vol. abs/1703.11000, 2017.[16] Q. Bateux, E. Marchand, J. Leitner, F. Chaumette, and P. Corke,
“Visual servoing from deep neural networks,” in ”New Frontiers in
Deep Learning in Robotics” Workshop at the Robotics: Systems and
Science (RSS 2017) conference, Boston, MA, July 2017.[17] S. Hutchinson, G. Hager, and P. Corke, “A tutorial on visual servo
control,” Robotics and Automation, IEEE Transactions on, vol. 12,no. 5, pp. 651 –670, Oct. 1996.
[18] A. Saxena, H. Pandya, G. Kumar, A. Gaud, and K. M. Krishna,“Exploring convolutional networks for end-to-end visual servoing,” inRobotics and Automation (ICRA), 2017 IEEE International Conference
on. IEEE, 2017, pp. 3817–3823.[19] N. Imanberdiyev, C. Fu, E. Kayacan, and I. M. Chen, “Autonomous
navigation of uav by using real-time model-based reinforcement learn-ing,” in 2016 14th International Conference on Control, Automation,
Robotics and Vision (ICARCV), Nov 2016, pp. 1–6.[20] J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter, “Control of a quadrotor
with reinforcement learning,” IEEE Robotics and Automation Letters,vol. 2, no. 4, pp. 2096–2103, 2017.
[21] S. Choi, S. Kim, and H. J. Kim, “Inverse reinforcement learningcontrol for trajectory tracking of a multirotor uav,” International
Journal of Control, Automation and Systems, vol. 15, no. 4, 2017.[22] R. Polvara, M. Patacchiola, S. Sharma, J. Wan, A. Manning, R. Sut-
ton, and A. Cangelosi, “Autonomous quadrotor landing using deepreinforcement learning,” arXiv preprint arXiv:1709.03339, 2017.
[23] H. Shi, X. Li, K. S. Hwang, W. Pan, and G. Xu, “Decoupled visualservoing with fuzzy q-learning,” IEEE Transactions on Industrial
Informatics, vol. 14, no. 1, pp. 241–252, Jan 2018.[24] E. Malis, “Improving vision-based control using efficient second-
order minimization techniques,” in Robotics and Automation, 2004.
Proceedings. ICRA ’04. 2004 IEEE International Conference on,vol. 2, April 2004, pp. 1843–1848 Vol.2.
[25] P. I. Corke and S. A. Hutchinson, “A new partitioned approach toimage-based visual servo control,” IEEE Transactions on Robotics and
Automation, vol. 17, no. 4, pp. 507–515, Aug 2001.[26] F. Chaumette and S. Hutchinson, “Visual servo control. ii. advanced
approaches [tutorial],” Robotics Automation Magazine, IEEE, vol. 14,no. 1, pp. 109 –118, march 2007.
[27] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-miller, “Deterministic policy gradient algorithms,” in Proceedings of
the 31st International Conference on Machine Learning (ICML-14),2014, pp. 387–395.
[28] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,et al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[29] M. Quigley, J. Faust, T. Foote, and J. Leibs, “Ros: an open-sourcerobot operating system.”
[30] J. Pestana, I. Mellado-Bataller, J. L. Sanchez-Lopez, C. Fu, I. F. Mon-dragon, and P. Campoy, “A general purpose configurable controller forindoors and outdoors gps-denied navigation for multirotor unmannedaerial vehicles,” Journal of Intelligent & Robotic Systems, vol. 73, no.1-4, pp. 387–400, 2014.
[31] M. Dorigo and M. Colombetti, Robot shaping: an experiment in
behavior engineering. MIT press, 1998.[32] J. L. Sanchez-Lopez, M. Molina, H. Bavle, C. Sampedro, R. A. S.
Fernandez, and P. Campoy, “A multi-layered component-based ap-proach for the development of aerial robotic systems: The aerostackframework,” Journal of Intelligent & Robotic Systems, vol. 88, no. 2-4,pp. 683–709, 2017.
CONFIDENTIAL. Limited circulation. For review only.
Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.