c Consult author(s) regarding copyright matters · Carlos Sampedro 1, Alejandro Rodriguez-Ramos ,...

9
This may be the author’s version of a work that was submitted/accepted for publication in the following source: Sampedro, Carlos, Rodriguez-Ramos, Alejandro, Gil, Ignacio, Mejias Al- varez, Luis, & Campoy, Pascual (2018) Image-based visual servoing controller for multirotor aerial robots using deep reinforcement learning. In Laschi, C (Ed.) Proceedings of the 2018 IEEE/RSJ International Con- ference on Intelligent Robots and Systems. IEEE, United States of America, pp. 979-986. This file was downloaded from: https://eprints.qut.edu.au/120147/ c Consult author(s) regarding copyright matters This work is covered by copyright. Unless the document is being made available under a Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other uses. If the docu- ment is available under a Creative Commons License (or other specified license) then refer to the Licence for details of permitted re-use. It is a condition of access that users recog- nise and abide by the legal requirements associated with these rights. If you believe that this work infringes copyright please provide details by email to [email protected] Notice: Please note that this document may not be the Version of Record (i.e. published version) of the work. Author manuscript versions (as Sub- mitted for peer review or as Accepted for publication after peer review) can be identified by an absence of publisher branding and/or typeset appear- ance. If there is any doubt, please refer to the published source. https://doi.org/10.1109/IROS.2018.8594249

Transcript of c Consult author(s) regarding copyright matters · Carlos Sampedro 1, Alejandro Rodriguez-Ramos ,...

Page 1: c Consult author(s) regarding copyright matters · Carlos Sampedro 1, Alejandro Rodriguez-Ramos , Ignacio Gil , Luis Mejias 2 and Pascual Campoy 1 Abstract In this paper, we propose

This may be the author’s version of a work that was submitted/acceptedfor publication in the following source:

Sampedro, Carlos, Rodriguez-Ramos, Alejandro, Gil, Ignacio, Mejias Al-varez, Luis, & Campoy, Pascual(2018)Image-based visual servoing controller for multirotor aerial robots usingdeep reinforcement learning.In Laschi, C (Ed.) Proceedings of the 2018 IEEE/RSJ International Con-ference on Intelligent Robots and Systems.IEEE, United States of America, pp. 979-986.

This file was downloaded from: https://eprints.qut.edu.au/120147/

c© Consult author(s) regarding copyright matters

This work is covered by copyright. Unless the document is being made available under aCreative Commons Licence, you must assume that re-use is limited to personal use andthat permission from the copyright owner must be obtained for all other uses. If the docu-ment is available under a Creative Commons License (or other specified license) then referto the Licence for details of permitted re-use. It is a condition of access that users recog-nise and abide by the legal requirements associated with these rights. If you believe thatthis work infringes copyright please provide details by email to [email protected]

Notice: Please note that this document may not be the Version of Record(i.e. published version) of the work. Author manuscript versions (as Sub-mitted for peer review or as Accepted for publication after peer review) canbe identified by an absence of publisher branding and/or typeset appear-ance. If there is any doubt, please refer to the published source.

https://doi.org/10.1109/IROS.2018.8594249

Page 2: c Consult author(s) regarding copyright matters · Carlos Sampedro 1, Alejandro Rodriguez-Ramos , Ignacio Gil , Luis Mejias 2 and Pascual Campoy 1 Abstract In this paper, we propose

Image-Based Visual Servoing Controller for Multirotor Aerial Robots

Using Deep Reinforcement Learning

Carlos Sampedro1, Alejandro Rodriguez-Ramos1, Ignacio Gil1, Luis Mejias2 and Pascual Campoy1

Abstract— In this paper, we propose a novel Image-BasedVisual Servoing (IBVS) controller for multirotor aerial robotsbased on a recent deep reinforcement learning algorithm namedDeep Deterministic Policy Gradients (DDPG). The proposedRL-IBVS controller is successfully trained in a Gazebo-basedsimulated environment in order to learn the appropriate IBVSpolicy for directly mapping a state, based on errors in the image,to the linear velocity commands of the aerial robot. A thoroughvalidation of the proposed controller has been conducted insimulated and real flight scenarios, demonstrating outstandingcapabilities in object following applications. Moreover, weconduct a detailed comparison of the RL-IBVS controller withrespect to classic and partitioned IBVS approaches.

I. INTRODUCTION

Visual servoing is a mature technique, which has been

extensively studied in the literature over the years. Despite

many efforts, the classic approach to map visual errors to

actuator commands still has limitations in terms of conver-

gence, stability and gain design. In the field of aerial robotics,

visual servoing has been used in a wide range of applications

such as object following [1], [2], [3], object grasping and

manipulation [4], [5], landing [6], and inspection tasks [7],

[8], [9], among others. In many cases, technique success has

been achieved by linearizing states around an operating point

(i.e hover or near hover).

In its two variations, traditional visual servoing strategies,

i.e, Position-Based Visual Servoing (PBVS) and Image-

Based Visual Servoing (IBVS) have achieved acceptable

performance, however notable convergence and stability

problems [10], [11] can appear if the task is performed in a

zone away from the operating point (or linearization point).

Some of these problems are related to the calculation of the

image Jacobian, also called interaction matrix, which defines

a crucial role in the implementation of the control law, but

also to the choice of an appropriate gain that provides a good

tradeoff between convergence and stability. Other important

problems are those related to local minima solutions owing

to unrealizable image motions. Moreover, the tuning process

of the different parameters of the control model can become

an onerous and hard procedure in most of the applications.

Inspired by the aforementioned limitations, in this paper

we propose a novel IBVS controller based on a recent deep

reinforcement learning algorithm named Deep Deterministic

Policy Gradients (DDPG) [12]. Taking as input the errors in

1Computer Vision and Aerial Robotics group, Centre for Automationand Robotics, Universidad Politecnica de Madrid (UPM-CSIC), Calle JoseGutierrez Abascal 2, Madrid (Spain). [email protected]

2 School of Electrical Engineering and Computer Science, QueenslandUniversity of Technology, Brisbane QLD 4000, Australia.

the image plane with respect to a predefined reference, the

agent is successfully trained in order to generate the appropri-

ate linear velocity commands in the x, y and z directions for

controlling the aerial robot. The training process of the agent

has been conducted using our own reinforcement learning

framework, which integrates recent deep reinforcement learn-

ing algorithms with an aerial robotics simulator (i.e. RotorS

Gazebo [13]). In order to infer velocity commands, the

agent’s state is composed of the errors in the image plane and

their respective derivatives. Furthermore, the reward function

is carefully designed in order to obtain a smooth behavior of

the agent while performing object following missions. These

missions have been conducted in simulated and real flight

experiments, demonstrating the appropriate capabilities of

the proposed RL-IBVS algorithm.

The main contributions of the presented work are summa-

rized here:

• A novel IBVS controller for multirotor aerial robots

based on deep reinforcement learning is proposed. The

RL-IBVS controller is completely trained in simulation

and tested in simulated and real scenarios.

• An extensive validation and comparison of the proposed

RL-IBVS approach with respect to classic IBVS con-

trollers are conducted in simulated and real flights.

The remainder of this paper is structured as follows:

Section II provides an overview of the related work, Section

III introduces the formulation of the proposed RL-IBVS

controller. The experiments conducted in simulated and real

scenarios are presented in Section IV. Results are discussed

in Section V, and finally Section VI concludes the paper as

well as points out future research directions.

II. RELATED WORK

Recent literature on visual servoing and machine learning

have focused on learning features and predictive models

under the impression that traditional visual servoing have

directed most efforts at designing stable and convergent

controllers (assuming such features are available). We see

the opposite statement, nowadays, there is a suite of ro-

bust descriptors and detectors that can work in challenging

lighting conditions, therefore a problem still remains on

designing controllers that are robust and stable in a large task

space (not just about the linearization point). The problem

becomes amplified when we consider aerial platforms such

as quadrotors. One of the pioneering works on the use of

reinforcement learning for aerial robots is [14], where a

reinforcement learning (optimal control) algorithm was used

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.

Page 3: c Consult author(s) regarding copyright matters · Carlos Sampedro 1, Alejandro Rodriguez-Ramos , Ignacio Gil , Luis Mejias 2 and Pascual Campoy 1 Abstract In this paper, we propose

to find a controller that is optimized for specific aerial ma-

neuvers. Recent efforts on deep learning and visual servoing

include Lee et al. [15] in which a Q-iteration algorithm is

used to learn which features are best suited for the task

at hand, however ultimately their optimization function is

solved efficiently by the linearization of the dynamics around

an operating point. Convolutional neural networks (CNNs)

have been popular in many areas of robotics, science and

engineering. For instance, Bateux et al. [16] have addressed

the visual servoing problem by training a CNN to estimate

the relative pose between the current and reference image,

then having this relative translation and rotation a classic

visual servoing control law is used [17]. Saxena et al. [18]

similarly approach the problem using CNNs as well, however

additional experimentation is conducted on a quadrotor using

a standard servoing control approach.

In terms of control and navigation, reinforcement learning

has also been used to learn navigation routines for a quadro-

tor [19], control and stabilization [20], target tracking [21]

and landing [22], not explicitly addressing the visual servoing

task with machine learning. Closest work to the one pre-

sented here is Shi et al. [23] in which two gains (translation

and rotation) are learned to position a quadrotor relative to a

target using fuzzy Q-learning. From previous literature, little

attention has been placed on learning gains for robustness

and stability. This is a standing issue in many decoupled

IBVS approaches for aerial robotics.

III. IMAGE-BASED VISUAL SERVOING

CONTROLLERS

A. Classic and Partitioned IBVS Controllers

The use of IBVS approaches with aerial robots is a chal-

lenging task. Classic IBVS as proposed in the literature [17]

establishes that: ν = −λL+s e will drive linear and angular

velocity references to zero as the features s move towards

their desired locations s∗, being the error: e = s− s∗. This

approach requires a minimum of four points, but due to the

estimation of varying 3-D parameters during the visual servo,

the term L+s can be difficult to estimate or approximate. An

alternative approach detailed in [24], proposes that L+s =

1

2(Ls + L∗

s)+ where L∗

s is the interaction matrix at the

desired feature location. Better convergence capabilities than

classic IBVS has been proven in this work. In this paper, we

use the approach described in [24] as one of the baselines

techniques (referred to as classic IBVS) to compare the

performance of our proposed RL-IBVS. Furthermore, we

enhance the classic IBVS by introducing an adaptive gain

λ(x) = a exp(−bx)+c, where a = λ(0)−λ(∞), b = λ(0)/aand c = λ(∞) are constant parameters and x is the infinity

norm of the task Jacobian to consider.

For aerial vehicles, a common way to deal with underac-

tuated systems such as quadrotors is to decouple the visual

servoing task [25]. A decoupled IBVS approach aims at

separating translational and rotational degrees of freedom

in the control task. A detailed overview of this technique

is presented in [25], [26]. Inspired by the decoupled ap-

proaches, the partitioned IBVS emerges, which exhibits sim-

ilar decoupling properties but using only features expressed

in the image plane. The formulation of the partitioned IBVS

is based on the decoupling of the motion related to the zaxis from the x and y axes motion. For this purpose, the

interaction matrix is partitioned such that:

s = Lsν (1)

= Lxyvxy + Lzvz (2)

= sxy + sz (3)

where Lxy = columns{1, 2, 4, 5} of Ls and Lz =columns{3, 6} of Ls, and νxy = (vx, vy, ωx, ωy) and

νz = (vz, ωz). ν = (v, ω) = (vx, vy, vz, ωx, ωy, ωz) ∈ ℜ6

is the vector that contains linear and angular velocities of

the camera and s are the feature velocities in the image,

which coordinates are defined by s = (u, v) with their

desired values being s∗ = (u∗, v∗). Then, the control law

(ν = −λL+e), with ei = si − s∗i and λ a positive gain,

becomes:

νxy = −L+xy(λe+ Lzνz) (4)

Assuming an operating point (linearization point -near

hover-), (4) can be simplified, indeed, obtaining linear and

angular velocity commands that can be sent separately and

simultaneously to the quadrotor in order to achieve transla-

tion and rotation. For instance, rotation can be controlled by

ωz = λωz(ψ∗

t − ψt) where ψt and ψ∗

t are the orientation of

the target in the image and its desired value, respectively.

Altitude can be controlled by vz = λvz lnσ∗

σwhere σ and

σ∗ are the area and its desired value of the polygon enclosed

by the set of features s. In this work, we use the partitioned

IBVS approach as another baseline technique in order to

evaluate our RL-IBVS.

B. Reinforcement Learning IBVS Controller

1) Background: The principal components in a reinforce-

ment learning problem are the agent and the environment.

The agent interacts at each time step with the environment

by executing an action and receiving an observation and a

reward from the latter. The final goal of the agent is to

learn the optimal policy in order to maximize the cumulative

future reward: Rt =∑T

i=t γi−tr(si, ai) given the reward

function r(si, ai), state si, the action ai, and a discount

factor γ ∈ [0, 1]. In reinforcement learning, the policy is

traditionally evaluated using the Bellman equation (5), which

computes the expected return once the agent executes an

action at at state st and then follows policy π(a|s) thereafter.

Qπ(st, at) = Eπ[Rt|st, at] (5)

where Qπ(st, at) is the action-value function.

Recently, deep reinforcement learning methods have uti-

lized neural networks to learn the policy as well as to

approximate the action-value function [12], [27], [28]. In [28]

two key techniques were introduced in order to efficiently use

neural networks in the reinforcement learning paradigm: the

experience replay, which helps for breaking the correlation

between observations, and the use of target networks in order

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.

Page 4: c Consult author(s) regarding copyright matters · Carlos Sampedro 1, Alejandro Rodriguez-Ramos , Ignacio Gil , Luis Mejias 2 and Pascual Campoy 1 Abstract In this paper, we propose

RotorS

Gazebo

Environment

Interface

(ROS)

Environment

Agent

Interface

(ROS)

Velocity

Controller

RL-IBVS

Object

Detector

State

(ex, ey, e�, dex, dey, de�)

State

ROI

Reward

Image

Image

Action

Bottom

Camera

ImageAction

(ax, ay, az)

Objects

State

DDPG

Critic

Actor

Reward

Agent

Action

(ax, ay, az)

Action

(ax, ay, az)

Fig. 1: Architecture of the proposed RL-IBVS system. The enviroment receives as input a ROI in the image plane computed

by an object detector. The agent which implements the DDPG algorithm, receives an observation and a reward from the

environment and computes the action (linear velocity) to be commanded to the velocity controller. The dotted lines represent

interactions between the components in training mode, while the continuous lines depict the interactions in test mode.

to reduce the correlation between the action-value function

(Q) and the target values r + γmaxa′ Q(s′, a′). The last

technique was improved in the DDPG algorithm, where a soft

updating method was introduced to update the parameters of

the target critic and target actor networks.

Regarding the DDPG algorithm, it consists of a model-

free, off-policy algorithm based on an actor-critic architec-

ture. The critic neural network acts as a non-linear function

approximator for estimating the action-value function, while

the neural network represented by the actor is used to

compute the policy that directly maps a state to an action

for the agent.

2) RL-IBVS Architecture: The architecture of the pro-

posed RL-IBVS algorithm has been designed in order to

integrate deep reinforcement learning agents with a Gazebo-

based simulator (e.g. RotorS Gazebo). The Robot Operating

System (ROS) [29] middleware is used to communicate the

agent and the environment (see Fig. 1).

As shown in Fig. 1, the environment receives a Region

Of Interest (ROI) from an object detector algorithm, and

computes the observation at current timestep, which consists

of a 6-dimensional vector defined by (6).

S = {ex, ey, eΦ, dex, dey, deΦ} (6)

where ex, ey represent the normalized error of the center

of the detected ROI with respect to the reference defined

in the image plane, and eΦ is the error in diameter of

the circumscribed circle to the ROI (Φo) with respect to

the reference diameter (Φref ) (see Fig. 2). The variables

dex, dey, deΦ are the normalized derivatives of the error

filtered using a low-pass filter. It has to be remarked here that

these errors in the image plane are computed with respect

to a virtual stabilized image plane. In order to compute

this virtual plane, we apply the transformation in (7) to the

actual ROI taking into account the roll and pitch information

provided by the Inertial Measurement Unit (IMU) of the

aerial robot. This stabilization procedure has been found to

be crucial in the learning process of the agent.

p′ = KRK−1p (7)

where p represents a point in the actual image plane,

p′ is the point in the virtual stabilized image plane, K is

the projection matrix and R is the rotation matrix which

encapsulates the roll and pitch information in the aerial robot

frame of reference (U) expressed in the camera frame (C).

The reference in the image is computed by projecting

the desired 3D points of the target object in the image

plane using the intrinsic camera parameters and considering a

pinhole camera model. In this work, the desired 3D position

of the robot with respect to the object is always considered

centered with respect to the bottom camera frame (C) and at

the desired height above the object.

The agent, which implements the DDPG algorithm, gen-

erates a 3-dimensional continuous action (8) computed by

the actor network and receives an 6-dimensional observation

(6) and a reward (10) from the environment.

A = {ax, ay, az} (8)

where ax, ay, az are the linear velocities commanded to

the velocity controller [30] referred in the world frame (W).

The actor and critic neural networks that compose the

DDPG agent consist of feed-forward neural networks with

two hidden layers of 400 and 300 units. The activation

function utilized in each hidden layer is the Rectified Linear

Unit (ReLU). The output layer of the actor network is

composed of 3 units with a tanh activation function bounded

to [−0.5, 0.5] in order to generate the actions commanded to

the velocity controller. The output layer of the critic network

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.

Page 5: c Consult author(s) regarding copyright matters · Carlos Sampedro 1, Alejandro Rodriguez-Ramos , Ignacio Gil , Luis Mejias 2 and Pascual Campoy 1 Abstract In this paper, we propose

consists of one unit with a linear activation function in order

to estimate the Q-value function.

3) Reward Function: The reward function implemented

in the environment’s side of the RL-IBVS architecture is

designed in order to evaluate the performance of the agent

in two consecutive time steps according to (10). For this

purpose a shaping function [31] is introduced. As can be

seen in (9), the shaping term penalizes big values of the state

components, that is, we penalize the agent when the errors

computed in the image plane are big. Furthermore, we add

an extra term to the shaping function which penalizes big

values of the actions executed by the agent with the purpose

of obtaining a smoother behavior.

shapingt = −α√

e2x + e2y − β√

de2x + de2y

−γ|eΦ| − δ|deΦ| − ǫ√

a2x + a2y + a2z

(9)

rt = shapingt − shapingt−1 (10)

where rt is the reward obtained at time step t, and

α, β, γ, δ are positive gains that control the penalization in

the error in position and diameter in the image plane together

with their derivatives. ǫ represents the penalization in the

action term. These gains have been empirically obtained,

being α = 100, β = 10, γ = 100, δ = 10, and ǫ = 1.

yW

xW

zW

xCyC zC

v

u

eX

�ref

ey

Image Plane

‹W›

‹C›

yU xU

zU

‹U›

fc

�o

Fig. 2: Notation and reference frames utilized in this work.

Where W , C and U stand for world, camera and aerial robot

frames respectively. The variables used for computing the

state are also depicted in the image plane.

IV. EXPERIMENTS AND RESULTS

A video demonstration of the reported experiments and

results is provided with this manuscript in: https://

vimeo.com/258063350.

A. Experimental Setup

In order to perform simulated and real flight experiments,

the proposed RL-IBVS architecture has been integrated

within the Aerostack framework [32], which is built on top of

the ROS middleware, and provides the necessary additional

software components in order to conduct autonomous flights

(e.g. state estimator, velocity controller, etc).

Two main simulation scenarios have been created using

RotorS Gazebo simulator. The first scenario presented in Fig.

4a is used for training the RL-IBVS agent and is composed of

an AscTec Hummingbird quadrotor representing the robotic

agent, and a cylindrical object utilized as the target for

visual servoing purposes. The Hummingbird quadrotor is

equipped with a fisheye-lens bottom-looking camera with a

horizontal field of view of 132◦ and an image size of 640× 480 pixels. For training the RL-IBVS agent, TensorFlow1

library on a GPU Nvidia GeForce GTX 970 is utilized under

Ubuntu 16.04 operating system. A second Gazebo simulation

scenario (see Fig. 4b) has been created in order to test the

RL-IBVS agent and compare it with the IBVS approaches

presented in Section III-A.

Several real flight experiments have also been conducted

in order to test the capabilities of the proposed IBVS

approaches on a real aerial robotic platform. For this purpose,

we use a Parrot Bebop 2 as the aerial robotic platform

equipped with a Parrot S.L.A.M. dunk (see Fig. 7a), utilized

in this work only for obtaining velocities estimation, and

a camera with an image size of 856 × 480 pixels. An

OptiTrack motion capture system is used in the first real

flight experiment only for gathering ground truth data relative

to the state of the robot and the target. In order to obtain

a reliable comparison of the different IBVS approaches, in

all the experiments we limit the commanded actions to the

range [−0.5, 0.5] m/s and introduce them to the same velocity

controller [30].

B. Training Methodology

Taking into account the aforementioned experimental

setup and using the scenario presented in Fig. 4a, the RL-

IBVS agent is trained using an episodic RL setting. In each

episode, the agent is placed in a random position within an

area of 2 m × 2 m × 1.7 m while the target object is static

always at the origin. In each time step, the agent which runs

at 20 Hz executes an action with added noise according to

an Ornstein-Uhlenbeck distribution with the final objective of

being completely centered on the target object at the desired

altitude with respect to it. The episode is considered finished

when a terminal state is reached. This situation can occur

due to three main events: the target object is out of the

image boundaries, the aerial robot is out of the altitude limits

(< 0.7 m or > 1.7 m), or the maximum number of steps per

episode is reached. When any of the first two events occurs,

the agent is rewarded negatively with -100. In the rest of

the time steps, the agent is rewarded according to (10). The

maximum number of steps per episode utilized is 4000.

1https://www.tensorflow.org/

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.

Page 6: c Consult author(s) regarding copyright matters · Carlos Sampedro 1, Alejandro Rodriguez-Ramos , Ignacio Gil , Luis Mejias 2 and Pascual Campoy 1 Abstract In this paper, we propose

600 800 1000 1200 1400

Episode

-150

-100

-50

0

50

Accu

m R

ew

ard

(a)

600 800 1000 1200 1400

Episode

-5

0

5

10

Accu

m Q

-Va

lue

104

(b)

Fig. 3: Training curves. (a) Accumulated reward per episode.

(b) Accumulated action-value function (Q) per episode. Each

episode is composed of 4000 simulation steps.

(a)

(b)

Fig. 4: Gazebo simulation environments. (a) Environment

created for training the RL-IBVS agent. (b) Environment

designed for evaluating the IBVS approaches using a leader-

follower configuration.

During the training process of the actor’s and critic’s neu-

ral networks, Adam optimizer has been utilized with a base

learning rate of 10−4 for the actor and 10−3 for the critic,

and a minibatch size of 64. Using this configuration, the

training process of the RL-IBVS agent took approximately

17 hours, learning a stable behavior around episode 1400,

which can be reviewed in the training curves presented in

Fig. 3. In this figure, we only show the behavior of the

agent from episode 600 for clarity. From episode 0 to 800 the

agent is continuously giving very high positive and negative

commands in altitude reaching a terminal state in each time

step. This fact produces the Q-value curve to have values

around zero until the episode 800.

C. Simulation Experiments

The aim of this section is to validate the proposed RL-

IBVS controller and compare it with the IBVS approaches

presented in Section III-A. For this purpose, a leader-follower

experiment using two aerial robots has been designed in

RotorS Gazebo simulator (see Fig. 4b). In the proposed

experiment, an AscTec Firefly quadrotor is used as the

leader and an AscTec Hummingbird acts as the follower.

The leader performs a rhomboidal trajectory (top view) with

an ascending or descending movement depending on the side

of the rhomboid (see Figures 5c, 5f and 5i) at v = 0.33 m/s,

and hovers during several simulation steps in its corners in

order to evaluate the stabilization of each IBVS controller.

The results obtained during the execution of the simulation

experiment are shown in Fig. 5. The results presented in Fig.

5 are summarized on Table I which shows the mean and the

standard deviation for each error variable.

TABLE I: Mean and standard deviation errors in pixels

obtained for the simulation flight experiment of Fig. 5.

IBVS approach ex ey eΦ

Classic 1.6± 24.2 −2.5± 27.0 1.4± 9.1

Partitioned −0.2± 21.2 −1.2± 27.1 0.6± 6.2

RL 5.4± 40.2 2.0± 25.1 −0.03± 15.7

D. Real Flight Experiments

Two main real flight experiments have been designed in

order to evaluate the capabilities of the IBVS approaches

presented in this work onboard a real aerial robotic plat-

form. The first real flight experiment has been designed

as a structured 4 m × 2 m indoor scenario in which we

recreate the leader-follower configuration of the simulation

experiment. In this experiment, the quadrotor leader has been

substituted by a human operator carrying a target marker

in order to obtain a smoother trajectory of the target for

improving the comparative analysis. The human operator

performs a continuous rectangular trajectory (top view) with

an ascending and descending movement in each long side of

the rectangle (see Figures 6c, 6f and 6i). We have considered

this configuration as the most suitable due to the high

turbulences experimented when flying two small quadrotors

one on top of the other. The results obtained during the

execution of the experiment are shown in Fig. 6. A summary

of the mean errors with their respective standard deviation

for the experiment of Fig. 6 are shown on Table II.

TABLE II: Mean and standard deviation errors in pixels

obtained for the real flight experiment of Fig. 6.

IBVS approach ex ey eΦ

Classic 19.9± 94.7 15.8± 72.9 −0.3± 7.3

Partitioned −2.2± 66.6 12.8± 58.1 −1.8± 5.7

RL −10.9± 52.5 2.5± 23.0 −5.2± 13.2

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.

Page 7: c Consult author(s) regarding copyright matters · Carlos Sampedro 1, Alejandro Rodriguez-Ramos , Ignacio Gil , Luis Mejias 2 and Pascual Campoy 1 Abstract In this paper, we propose

0 20 40 60 80 100

Time (s)

-150

-100

-50

0

50

100

150

Err

or

xy (

pix

els

)

ex

ey

(a)

0 20 40 60 80 100

Time (s)

-30

-20

-10

0

10

20

30

40

Err

or

(pix

els

)

(b)

06

4 4

2

Z (

m)

2 2

Y (m) X (m)

0

4

0-2 -2

-4 -4

Leader

Follower

0

1

2

3

(c)

0 20 40 60 80 100

Time (s)

-150

-100

-50

0

50

100

150

Err

or

xy (

pix

els

)

ex

ey

(d)

0 20 40 60 80 100

Time (s)

-30

-20

-10

0

10

20

30

40

Err

or

(pix

els

)

(e)

06

4 4

2

Z (

m)

2 2

Y (m) X (m)

0

4

0-2 -2

-4 -4

Leader

Follower

0

1

2

3

(f)

0 20 40 60 80 100

Time (s)

-150

-100

-50

0

50

100

150

Err

or

xy (

pix

els

)

ex

ey

(g)

0 20 40 60 80 100

Time (s)

-30

-20

-10

0

10

20

30

40

Err

or

(pix

els

)

(h)

06

4 4

2

Z (

m)

2 2

Y (m) X (m)

0

4

0-2 -2

-4 -4

Leader

Follower

0

1

2

3

(i)

Fig. 5: Simulation flight experiment for a leader-follower configuration. The leader executes a rhomboidal trajectory (top

view) hovering during some simulation steps at each corner. (a), (b), (c) Results obtained with the classic IBVS controller.

(d), (e), (f) Results acquired using the partitioned IBVS. (g), (h), (i) Results obtained for the proposed RL-IBVS approach.

The second real flight experiment has been conceived as

a visual servoing experiment using a moving car as the

object to follow which performs an oval trajectory in an

indoor scenario of 8 m × 8 m (see Fig. 7). The aim of

this experiment is to evaluate the capabilities of the RL-

IBVS approach while performing visual servoing on top of

a moving car that can exhibit changes in its trajectory. The

moving car is remotely controlled by a human operator who

performs changes in the forward direction of the car when

it moves along the y axis (see Fig. 7b). Furthermore, several

pillars are placed on the scenario in order to difficult the

movement of the aerial robot.

V. DISCUSSION

The experiments and results presented in simulated and

real indoor scenarios prove the good behavior exhibited by

the three IBVS controllers presented in this work. Remark-

able results are obtained in the case of the partitioned IBVS

and the proposed RL-IBVS having mean errors less than 6pixels in the simulation flight experiment of Fig. 5 and 13pixels in the real flight experiment presented in Fig. 6. More

notorious is the transition of the RL-IBVS from simulation

to real flight tests. Unlike classic and partitioned IBVS

approaches, in which a considerable effort was made on

tunning the parameters, the RL-IBVS exhibited a really quick

and almost direct transition to real flight experiments, which

is evidenced by the results presented on Tables I and II. In

both tables, the errors obtained for the RL-IBVS are within

the same range demonstrating the good transition capabilities

of the proposed approach. Moreover, these results are of

great interest taking into account that the RL-IBVS agent

was trained on simulation using an AscTec Hummingbird

quadrotor and tested on real flights using a Parrot Bebop 2.

This fact enhances the versatility of the proposed RL-IBVS.

This versatility is also validated in the results presented in

Fig. 7b. In this figure, it can be noticed the stable trajectory

performed by the RL-IBVS agent even in presence of sudden

movements of the car and false positives in the detection of

the target due to reflections on the metallic surface of the

car.

Despite the good capabilities exhibited by the proposed

RL-IBVS approach, it has to be noted the difficult process of

designing the reward function in order to train the agent. We

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.

Page 8: c Consult author(s) regarding copyright matters · Carlos Sampedro 1, Alejandro Rodriguez-Ramos , Ignacio Gil , Luis Mejias 2 and Pascual Campoy 1 Abstract In this paper, we propose

0 20 40 60

Time (s)

-200

-100

0

100

200

300

Err

or

xy (

pix

els

)

ex

ey

(a)

0 20 40 60

Time (s)

-40

-20

0

20

40

Err

or

(pix

els

)

(b)

0

1

2

2

Z (

m)

3

Y (m)0

X (m)

2-2 0-2-4

Follower

Leader

0

1

2

3

4

5

(c)

0 20 40 60

Time (s)

-200

-100

0

100

200

300

Err

or

xy (

pix

els

)

ex

ey

(d)

0 20 40 60

Time (s)

-40

-20

0

20

40

Err

or

(pix

els

)

(e)

02

1

2

Z (

m)

3

Y (m) 0

2

X (m)

0-2 -2-4

Follower

Leader

0

1

2

3

4

5

(f)

0 20 40 60

Time (s)

-200

-100

0

100

200

300

Err

or

xy (

pix

els

)

ex

ey

(g)

0 20 40 60

Time (s)

-40

-20

0

20

40

Err

or

(pix

els

)

(h)

0

1

2

2

Z (

m)

3

Y (m) 0

X (m)

2-2 0-2-4

Follower

Leader

0

1

2

3

4

5

(i)

Fig. 6: Real flight experiment for a leader-follower configuration. The leader is characterized by a human carrying a target

object. The follower consists in a Parrot Bebop 2 quadrotor. (a), (b), (c) Results obtained with the classic IBVS controller.

(d), (e), (f) Results acquired using the partitioned IBVS. (g), (h), (i) Results obtained for the proposed RL-IBVS approach.

have found the success of training the agent very dependent

on the reward function design, and thus will be considered

as a future research topic. Moreover, we have found the

stabilization of the image as a crucial technique in order

to train and test the agent. Training the agent using directly

the actual detection of the target in the image plane, led to

learning unsuccessful policies.

VI. CONCLUSIONS AND FUTURE WORK

In this paper, we proposed a novel Image-Based Visual

Servoing approach using Deep Reinforcement Learning (RL-

IBVS) for controlling a multirotor aerial robotic platform

using only the errors computed on the image provided

by a bottom-looking camera. The proposed approach has

been extensively compared with two state-of-the-art IBVS

techniques, revealing the good capabilities of the RL-IBVS

in several simulated and real flight scenarios. One of the

key properties of the RL-IBVS is the quick and almost

direct transition from simulation to real flight. Moreover,

this property is enhanced by the fact that the RL-IBVS was

trained only in simulation using a quadrotor aerial robot

(AscTec Hummingbird) with different dynamics as compared

to the one used in real flights (Parrot Bebop 2). Future

work is aimed towards extending the capabilities of the RL-

IBVS in order to directly control the quadrotor in attitude

commands (roll, pitch and yaw). In addition, other deep

reinforcement learning approaches will be explored.

ACKNOWLEDGMENT

The authors would like to thank the Spanish Ministry of

Science MICYT DPI2014-60139-R for project funding. The

LAL UPM and the MONCLOA Campus of International

Excellence are also acknowledged for the funding of the

predoctoral contract of one of the authors.

REFERENCES

[1] J. Pestana, J. L. Sanchez-Lopez, P. Campoy, and S. Saripalli, “Visionbased gps-denied object tracking and following for unmanned aerialvehicles,” in Safety, security, and rescue robotics (ssrr), 2013 ieee

international symposium on. IEEE, 2013, pp. 1–6.[2] J. Thomas, J. Welde, G. Loianno, K. Daniilidis, and V. Kumar,

“Autonomous flight for detection, localization, and tracking of movingtargets with a small quadrotor,” IEEE Robotics and Automation Letters,vol. 2, no. 3, pp. 1762–1769, 2017.

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.

Page 9: c Consult author(s) regarding copyright matters · Carlos Sampedro 1, Alejandro Rodriguez-Ramos , Ignacio Gil , Luis Mejias 2 and Pascual Campoy 1 Abstract In this paper, we propose

(a)

08

-46

-24

Y (m)X (m)

02

2 0

-24

1

Z (

m) 2

3

A

A

B

B

(b)

Fig. 7: Real flight experiment with a remotely controlled car

as the object to follow. (a) Indoor scenario created for the

experiment. (b) Trajectory generated by the aerial robot (red

color) while following the car (green cuboid). The arrows

depict the movement of the car.

[3] S. Li, T. Liu, C. Zhang, D.-Y. Yeung, and S. Shen, “Learningunmanned aerial vehicle control for autonomous target following,”arXiv preprint arXiv:1709.08233, 2017.

[4] J. Thomas, G. Loianno, K. Sreenath, and V. Kumar, “Toward imagebased visual servoing for aerial grasping and perching,” in Robotics

and Automation (ICRA), 2014 IEEE International Conference on.IEEE, 2014, pp. 2113–2118.

[5] V. Lippiello, J. Cacace, A. Santamaria-Navarro, J. Andrade-Cetto,M. A. Trujillo, Y. R. Esteves, and A. Viguria, “Hybrid visual servoingwith hierarchical task composition for aerial manipulation,” IEEE

Robotics and Automation Letters, vol. 1, no. 1, pp. 259–266, 2016.

[6] D. Lee, T. Ryan, and H. J. Kim, “Autonomous landing of a vtoluav on a moving platform using image-based visual servoing,” inRobotics and Automation (ICRA), 2012 IEEE International Conference

on. IEEE, 2012, pp. 971–976.

[7] N. Metni and T. Hamel, “A uav for bridge inspection: Visual servoingcontrol law with orientation limits,” Automation in construction,vol. 17, no. 1, pp. 3–10, 2007.

[8] I. Sa, S. Hrabar, and P. Corke, “Inspection of pole-like structuresusing a vision-controlled vtol uav and shared autonomy,” in Intelli-

gent Robots and Systems (IROS 2014), 2014 IEEE/RSJ International

Conference on. IEEE, 2014, pp. 4819–4826.

[9] O. Araar and N. Aouf, “Visual servoing of a quadrotor uav for au-tonomous power lines inspection,” in Control and Automation (MED),

2014 22nd Mediterranean Conference of. IEEE, 2014.

[10] B. Espiau, “Effect of camera calibration errors on visual servoing inrobotics,” in Experimental robotics III. Springer, 1994, pp. 182–192.

[11] F. Chaumette, “Potential problems of stability and convergence inimage-based and position-based visual servoing,” in The confluence

of vision and control. Springer, 1998, pp. 66–78.[12] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,

D. Silver, and D. Wierstra, “Continuous control with deep reinforce-ment learning,” arXiv preprint arXiv:1509.02971, 2015.

[13] F. Furrer, M. Burri, M. Achtelik, and R. Siegwart, “Rotors — amodular gazebo mav simulator framework,” Robot Operating System

(ROS), vol. 625, p. 595, 2016.[14] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng, “An application of

reinforcement learning to aerobatic helicopter flight,” in Advances in

neural information processing systems, 2007, pp. 1–8.[15] A. X. Lee, S. Levine, and P. Abbeel, “Learning visual servoing with

deep features and fitted q-iteration,” CoRR, vol. abs/1703.11000, 2017.[16] Q. Bateux, E. Marchand, J. Leitner, F. Chaumette, and P. Corke,

“Visual servoing from deep neural networks,” in ”New Frontiers in

Deep Learning in Robotics” Workshop at the Robotics: Systems and

Science (RSS 2017) conference, Boston, MA, July 2017.[17] S. Hutchinson, G. Hager, and P. Corke, “A tutorial on visual servo

control,” Robotics and Automation, IEEE Transactions on, vol. 12,no. 5, pp. 651 –670, Oct. 1996.

[18] A. Saxena, H. Pandya, G. Kumar, A. Gaud, and K. M. Krishna,“Exploring convolutional networks for end-to-end visual servoing,” inRobotics and Automation (ICRA), 2017 IEEE International Conference

on. IEEE, 2017, pp. 3817–3823.[19] N. Imanberdiyev, C. Fu, E. Kayacan, and I. M. Chen, “Autonomous

navigation of uav by using real-time model-based reinforcement learn-ing,” in 2016 14th International Conference on Control, Automation,

Robotics and Vision (ICARCV), Nov 2016, pp. 1–6.[20] J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter, “Control of a quadrotor

with reinforcement learning,” IEEE Robotics and Automation Letters,vol. 2, no. 4, pp. 2096–2103, 2017.

[21] S. Choi, S. Kim, and H. J. Kim, “Inverse reinforcement learningcontrol for trajectory tracking of a multirotor uav,” International

Journal of Control, Automation and Systems, vol. 15, no. 4, 2017.[22] R. Polvara, M. Patacchiola, S. Sharma, J. Wan, A. Manning, R. Sut-

ton, and A. Cangelosi, “Autonomous quadrotor landing using deepreinforcement learning,” arXiv preprint arXiv:1709.03339, 2017.

[23] H. Shi, X. Li, K. S. Hwang, W. Pan, and G. Xu, “Decoupled visualservoing with fuzzy q-learning,” IEEE Transactions on Industrial

Informatics, vol. 14, no. 1, pp. 241–252, Jan 2018.[24] E. Malis, “Improving vision-based control using efficient second-

order minimization techniques,” in Robotics and Automation, 2004.

Proceedings. ICRA ’04. 2004 IEEE International Conference on,vol. 2, April 2004, pp. 1843–1848 Vol.2.

[25] P. I. Corke and S. A. Hutchinson, “A new partitioned approach toimage-based visual servo control,” IEEE Transactions on Robotics and

Automation, vol. 17, no. 4, pp. 507–515, Aug 2001.[26] F. Chaumette and S. Hutchinson, “Visual servo control. ii. advanced

approaches [tutorial],” Robotics Automation Magazine, IEEE, vol. 14,no. 1, pp. 109 –118, march 2007.

[27] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-miller, “Deterministic policy gradient algorithms,” in Proceedings of

the 31st International Conference on Machine Learning (ICML-14),2014, pp. 387–395.

[28] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,et al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[29] M. Quigley, J. Faust, T. Foote, and J. Leibs, “Ros: an open-sourcerobot operating system.”

[30] J. Pestana, I. Mellado-Bataller, J. L. Sanchez-Lopez, C. Fu, I. F. Mon-dragon, and P. Campoy, “A general purpose configurable controller forindoors and outdoors gps-denied navigation for multirotor unmannedaerial vehicles,” Journal of Intelligent & Robotic Systems, vol. 73, no.1-4, pp. 387–400, 2014.

[31] M. Dorigo and M. Colombetti, Robot shaping: an experiment in

behavior engineering. MIT press, 1998.[32] J. L. Sanchez-Lopez, M. Molina, H. Bavle, C. Sampedro, R. A. S.

Fernandez, and P. Campoy, “A multi-layered component-based ap-proach for the development of aerial robotic systems: The aerostackframework,” Journal of Intelligent & Robotic Systems, vol. 88, no. 2-4,pp. 683–709, 2017.

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2018 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. Received March 9, 2018.