Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl,...

67
VISION-BASED UAV POSE ESTIMATION A Thesis Presented by Zahra Mohajerani to The Department of Electrical and Computer Engineering In partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering Northeastern University Boston, Massachusetts 08/2008

Transcript of Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl,...

Page 1: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

VISION-BASED UAV POSE ESTIMATION

A Thesis Presented

by

Zahra Mohajerani

to

The Department of Electrical and Computer Engineering

In partial fulfillment of the requirements

for the degree of

Master of Science

in

Electrical Engineering

Northeastern University

Boston, Massachusetts

08/2008

Page 2: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

Acknowledgements

I would like to express my gratitude to all who supported me and helped me through-

out this study. I thank Ehsan Nourbakhsh, for his invaluable assistance, and for all

his helps, with the Linux and programming in C. I am also grateful to Fei Xiong,

who helped me run the experiments over and over again. I thank Ali Nouri, for his

guidance, his encouragement, and all his helps which showed me how to think like

a computer scientist, not an engineer. And finally, I thank my parents for always

having faith and belief in me.

v

Page 3: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

Contents

Acknowledgements v

1 Abstract 1

2 Introduction 2

2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2 Problem Description and Approach . . . . . . . . . . . . . . . . . . . 2

2.2.1 Unmanned Aircraft Vehicles (UAV) . . . . . . . . . . . . . . . 3

2.2.2 Pose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Pose Estimation - Related Works . . . . . . . . . . . . . . . . . . . 4

2.3.1 Single Camera and Moire Pattern . . . . . . . . . . . . . . . 5

2.3.2 Two Ground Cameras . . . . . . . . . . . . . . . . . . . . . . 7

2.3.3 Two Cameras: One Ground And One Onboard . . . . . . . . 8

2.3.4 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Essential Preliminaries 12

3.1 Markov Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Conditional Mean and Variance in Gaussian Distribution . . . . . . . 13

3.3 EKF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.5 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.5.1 DBN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.6 Shi and Tomasi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.7 Spatial Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.7.1 AT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.7.2 Coordinate Frame Relationships . . . . . . . . . . . . . . . . . 20

3.7.3 Stochastic Map . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.7.4 Updating The Map . . . . . . . . . . . . . . . . . . . . . . . . 26

vi

Page 4: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

4 SLAM 294.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.2 Observation Model . . . . . . . . . . . . . . . . . . . . . . . . 314.2.3 Motion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Probabilistic SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.1 Step one: Prediction(time-update) . . . . . . . . . . . . . . . 324.3.2 Step two: Correction(measurement-update) . . . . . . . . . . 32

4.4 EKF-SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4.1 Time Update(Predict) . . . . . . . . . . . . . . . . . . . . . . 344.4.2 Observation Update(Correct) . . . . . . . . . . . . . . . . . . 354.4.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . 35

5 Experiments 405.1 MonoSLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1.1 Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Vicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Conclusions 53

7 Discussion and Future Works 547.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.1.1 Which Feature to Add . . . . . . . . . . . . . . . . . . . . . . 547.1.2 Moving Objects . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.2 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Bibliography 56

Page 5: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

List of Tables

4.1 SLAM Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 EKF-SLAM Notations . . . . . . . . . . . . . . . . . . . . . . . . . . 34

viii

Page 6: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

List of Figures

2.1 6 Degree Of Freedom (6DOF) of an aircraft . . . . . . . . . . . . . . 4

2.2 A moire pattern generated by superposition of two layers consisting of

parallel lines; the lines of the revealing layer are parallel to the lines of

the base layer–courtesy of[15] . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Moire Target used by Tournier for UAV pose estimation . . . . . . . 6

2.4 Calculation of Altitude and Yaw angle . . . . . . . . . . . . . . . . . 7

2.5 Using the color markers under the quadrotor and 2 cameras onboard

and on the ground, pose of the UAV can be computed . . . . . . . . . 9

2.6 Using Accelerometers in wii and iphone for motion analysis . . . . . . 10

2.7 GUI view of MonoSLAM software. . . . . . . . . . . . . . . . . . . . 11

3.1 DBN for the SLAM problem. Ui:control input applied at time (i − 1) to

derive the system to state X(i), Z(i, j): observation of Lj made at time i. 17

3.2 A sequence of approximate transformations. line: AT, ellipse: AT’s uncer-

tainty, solid:gained by movement, dashed:gained by observation. . . . . . 20

3.3 Compounding, Merging and Reversal operations. line: AT, ellipse:

AT’s uncertainty, solid:gained by movement, dashed:gained by obser-

vation, Xij: AT from frame i to frame j . . . . . . . . . . . . . . . . 24

3.4 map of 3D world, each feature represented as a 6DOF vector (3 DOF posi-

tion and 3 DOF orientation) . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5 updating the map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

ix

Page 7: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

4.1 1:start the prediction step,2: estimate new location, 3:add process noise,

4:start the measurement step, measure features, 5:update positions and un-

certainties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 The convergence in landmark uncertainty. The plot shows a time history of

standard deviations of a set of landmark locations. A landmark is initially

observed with uncertainty inherited from the robot location and observation.

Over time, the standard deviations reduce monotonically to a lower bound.

New landmarks are acquired during motion . . . . . . . . . . . . . . . . 37

4.3 Before and after closing the loop, this figure is from [19] . . . . . . . 38

5.1 The evolution of depth probability density from uniform distribution

to Gaussian, points are uniformly distributed over the image patch’s

normal line from 0.5 to 5m. Courtesy of [12]. . . . . . . . . . . . . . . 43

5.2 The camera poses in 2 different viewpoints looking at the same image

patch with normal n. Courtesy of [12]. . . . . . . . . . . . . . . . . . 43

5.3 Camera Motion Model . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.4 Vicon camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.5 Vicon markers, from [3]. . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.6 The output of MonoSLAM for a sample sequence, with 4 startup fea-

tures. The map is built gradually by updating the visible features in

each frame. Blue marks correspond to failed matches, red marks cor-

respond to successful matches and yellow marks correspond to unused

features in current frame. . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.7 MonoSLAM Error in X, Y and Z Dimension. . . . . . . . . . . . . . . 51

5.8 UAV orientation, MonoSLAM and Vicon outputs . . . . . . . . . . . 51

5.9 Comparing the output trajectories in 2D space, from Vicon and MonoSLAM. 52

Page 8: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

7.1 a: existing map of the room; 10 points stars are the features, darker

color corresponds to high quality features and lighter colors corre-

sponds to low quality ones. 4 points pink stars are the selected land-

marks. b: updated map, selecting 2 other landmarks by only taking

into account the quality of features. b: updated map, selecting features

by quality and distance from the existing features in the map. . . . . 55

Page 9: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

Chapter 1

Abstract

The popularity of Unmanned Aerial Vehicles(UAVs) has increased dramatically in

the past few decades. The main challenging task for UAVs operation is to find

their position and orientation accurately and continuously during the flight, using

the onboard sensors (camera in our case). There are lots of different approaches

for UAV pose estimation in literature now. Several different approach and their

problems in real-life application were studied in this research. We chose the real

time single camera SLAM method [12], because our robot was equipped with an

onboard camera, and MonoSLAM had proved its reliability in previous localization

experiments[12],[10],[6],[11]. In our work, we obtained the ground truth by the Vicon

cameras. The experimental results suggest that this method can be reliably used in

the control of indoor UAV navigation.

1

Page 10: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

Chapter 2

Introduction

In our work, we consider the problem of pose estimation in a helicopter, when the

sensory information comes only from a single mounted camera. We show that a

previously studied algorithm called MonoSLAM [12] can be successfully used to solve

this task by checking the ground truth with Vicon cameras.

2.1 Motivation

The popularity of Unmanned Aerial Vehicles(UAVs) has increased dramatically in

the past few decades. The main challenging task for UAVs operation is to find

their position and orientation accurately and continuously during the flight, using

the onboard sensors.

2.2 Problem Description and Approach

In order to control the UAV during any mission, we need to know its position and

orientation to take the proper decision in each moment. In the cases where a complete

map of the environment is known ahead of time, this extra information can be used to

get leverage on solving the pose estimation problem. However, in most of the real-life

applications, the map is not available a priori. This imposes additional complexities

into the localization problem. Here we can redefine the localization problem as the

2

Page 11: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

2.2. PROBLEM DESCRIPTION AND APPROACH 3

ego-motion estimation of a moving robot inside the world and build the map of the

world at the same time and update it dynamically. This problem is known as The

Simultaneous Localization and Mapping (SLAM) problem in the robotics community.

As robot’s updating sensor we chose the camera mounted on a helicopter and used

the MonoSLAM program of Andrew Davison to implement the real time pose of the

UAV. Finally we checked the ground truth using Vicon cameras. The experiments

showed that the calculated pose by this algorithm is reliable enough to be used in the

UAV controlling step.

2.2.1 Unmanned Aircraft Vehicles (UAV)

Among all aircrafts, helicopters are are particularly interesting in that they possess

unique characteristics. Helicopters, unlike fixed wings aircrafts, can land and take off

vertically without the needs for a runway, because their rotors’ blades revolve in the

air, which lets the helicopters to move vertically without the need to move forward.

Another interesting characteristic of helicopters is that they can hover for an extended

period of time, and that they can handle low airspeed conditions. Because of these

unique properties, helicopters can be used to accomplish tasks that no other aircrafts

can.

Today, helicopters are used for transportation, construction, fire fighting, search and

rescue, and lots of different tasks that require the special capabilities mentioned above

[4].

UAVs are aircrafts that work without pilots. They can be used in situations, where a

human’s life would be at risk, like fire fighting, search and rescue in unknown fields,

and all missions which are too dangerous for manned aircrafts[wiki].

The focus of this research is to provide a way of controlling the unmanned quadrotor,

with the hope of taking advantages of the good characteristics of both UAVs and

helicopters.

Page 12: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

4 CHAPTER 2. INTRODUCTION

(a) horizontal (b) vertical (c) sideways

(d) roll (e) pitch (f) yaw

Figure 2.1: 6 Degree Of Freedom (6DOF) of an aircraft

2.2.2 Pose

Degrees of freedom (DOF) are the set of independent displacement of the object

in the space that completely define the position of the object or system during its

movement. 6DOF or pose refers to motions in 3-dimonsional space, three of these

degrees are related to angular position (rotation around perpendicular axes/attitude:

(φ, θ, ψ)) which are pitch, roll and yaw; and the other three are the translation in three

perpendicular axes (position:(x, y, z)) : moving backward/forward (Longitudinal),

up/down (Vertical) and left/right(Lateral). Figure 2.1 1 illustrates these 6DOF for

an airplane.

2.3 Pose Estimation - Related Works

The first step to successfully use UAV is to find its position and attitude (pose) in

the 3D space accurately and fast. This information is very critical for controlling

1from NASA: http://www.aviationsystemsdivision.arc.nasa.gov/.

Page 13: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

2.3. POSE ESTIMATION - RELATED WORKS 5

the UAV. Because the main use of UAVs is in unknown environments for search and

rescue, or for surveillance in cluttered environments, in which the global positioning

system (GPS) usually doesn’t work, lots of research has been recently done to find the

pose using other devices like camera or laser range finder and other sensors. A brief

summary of these approaches and their positive and negative properties is presented

in this section.

2.3.1 Single Camera and Moire Pattern

THE BASICS OF LINE MOIRÉ PATTERNS AND OPTICAL SPEEDUP

Emin Gabrielyan Switzernet Sàrl, Scientific Park of Swiss Federal

Institute of Technology, Lausanne (EPFL) [email protected]

Abstract We are addressing the optical speedup of movements

of layers in moiré patterns. We introduce a set of equations for computing curved patterns, where the formulas of optical speedup and moiré periods are kept in their simplest form. We consider linear movements and rotations. In the presented notation, all periods are relative to the axis of movements of layers and moiré bands.

Keywords: moiré patterns, line moiré, superposition images, optical speedup, moiré speedup, moiré magnification, moiré inclination angles, periodic moiré

1. Introduction Moiré patterns appear when superposing two

transparent layers containing correlated opaque patterns. The case when layer patterns comprise straight or curved lines is called line moiré.

This document presents the basics of line moiré patterns. We present numerous examples and we focus also on the optical speedup of moiré shapes when moving layer patterns. Numerous examples are present. Dynamic examples demonstrating the movements of layers are presented by GIF files (hyperlinks are provided in square brackets).

We develop here the most important formulas for computing the periods of superposition patterns, the inclination angles and the velocities of optical shapes when moving one of the layers.

In section 2, we demonstrate the phenomenon on the examples with horizontal parallel lines, which are further extended to cases with inclined and curved lines. In section 3 we present circular examples with straight radial lines, which are analogously extended.

2. Simple moiré patterns

2.1. Superposition of layers with periodically repeating parallel lines

Simple moiré patterns can be observed when superposing two transparent layers comprising periodically repeating opaque parallel lines as shown in Figure 1. The lines of one layer are parallel to the lines of the second layer.

Figure 1. Superposition of two layers consisting of

parallel lines, where the lines of the revealing layer are parallel to the lines of the base layer [eps], [tif], [png]

The superposition image does not change if transparent layers with their opaque patterns are inverted. We denote one of the layers as the base layer and the other one as the revealing layer. When considering printed samples, we assume that the revealing layer is printed on a transparency and is superposed on top of the base layer, which can be printed either on a transparency or on an opaque paper. The periods of the two layer patterns, i.e. the space between the axes of parallel lines, are close. We denote the period of the base layer as and the period of the revealing layer as . In Figure 1, the period of lines of the base layer is equal to 6 units, and the period of lines of the revealing layer is equal to 5.5 units.

bp

rp

The superposition image of Figure 1 outlines periodically repeating dark parallel bands, called moiré lines. Spacing between the moiré lines is much larger than the periodicity of lines in the layers.

Light areas of the superposition image correspond to the zones where the lines of both layers overlap. The dark areas of the superposition image forming the moiré lines correspond to the zones where the lines of the two layers interleave, hiding the white background. The labels of Figure 2 show the passages from light zones with overlapping layer lines to dark zones with interleaving layer lines. The light and dark zones are periodically interchanging.

Page 1 of 9

Figure 2.2: A moire pattern generated by superposition of two layers consisting ofparallel lines; the lines of the revealing layer are parallel to the lines of the baselayer–courtesy of[15]

Glenn Tournier [36], used a single camera onboard as a quadrotor sensor for pose

estimating. He used the property of Moire effect to estimate the quadrotor pose by

making a target with two different types of Morie patterns in two perpendicular sides

and 4 colored LEDs on each corner as indicated in Figure 2.3. The Morie pattern in

Page 14: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

6 CHAPTER 2. INTRODUCTION

Figure 2. Gratings to generate moire patterns.

Figure 3. Moire target with key features labeled.

on of top of the other a distance d apart as shown in Figure 2. The gratings in the figure are regularlyspaced with characteristic wavelength λ. The printed gratings are referred to as Ronchi rulings where thetransmittance profile is a square wave which is opaque for exactly half of the duty cycle and transparent forthe remaining half.11

Figure 3 shows a drawing of the target with key features labeled including the moire patterns createdby the gratings. The actual target as it appears on the image plane of the camera is shown in Figure 4.Five red lights on the target are used as feature points with the off-corner red light used to determine theorientation of the target. The red lights are given a reference letter as seen in Figure 3.

Referred to as the red dots on the image plane, the red lights are used to find the coarse and fine gratingsand compute yaw and altitude. Identical grating sets are printed in orthogonal directions. The differencebetween the fine gratings and coarse gratings is their characteristic wavelength λ shown in Figure 2. Thecoarse grating wavelength is larger than that of the fine grating, generating a lower frequency moire pattern.The purpose of the different moire pattern wavelengths will be discussed later.

III. Pose Estimation Equations

In this section, we present how the target allows the system to calculate the 6 DOF of the camera usingmoire patterns and geometry. The altitude and yaw estimations rely entirely on geometry while the lateraltranslations and rotations about those axes incorporate the use of the patterns.

The several coordinate systems used to perform the calculations are shown in Figure 5. A fixed inertialframe (X, Y, Z) is assumed to be connected to the target with the origin located on the red light labeled B.A vehicle coordinate system is connected through its center of mass (xv, yv, zv) and another set is connectedthrough the camera, (xc, yc, zc) with the origin at the focal point, F . The image plane is described by (u, v).Pitch, roll, and yaw are all given in the camera coordinate system as Θ, Φ, and Ψ respectively.

A. Position

A pinhole model of the camera is assumed to calculate the altitude of the vehicle relative to the target, andwe assume that the camera is located at the vehicle’s center of gravity, i.e. the vehicle coordinate system is

3 of 16

American Institute of Aeronautics and Astronautics

(a) Target Model Figure 4. Actual image of moire target as seen by the camera.

identical to the camera coordinate system. In addition, the image plane is assumed to be nearly parallel tothe target plane when the camera is looking down at the target such as during a hover or landing condition.Given these assumptions, the basic principle of similar triangles can be applied as shown in Figure 6 tocompute altitude. F is the focal point of the camera, P is the principle point, K is the point is directlybelow, and D and E are points on the target in the fixed coordinate frame (X,Y, Z) while D′ and E′ are theequivalent points as they appear in the image plane (u, v). The principle point is defined as the point wherethe line that is perpendicular to image plane goes through the focal point and intersects the image plane.Since 4FKD is similar to 4FPD′ and 4FKE is similar to 4FPE′, we obtain the altitude, h, using

s

h=

r

f. (1)

The calculation requires knowledge of the physical distance between the feature points on the target, s,the corresponding distance in the image in pixels, r, and the focal length of the camera, f , in units of pixels.The altitude is computed four times using the lines BC, CD, ED, and EB on the image. The resultingvalues are averaged to remove the effect of perspective. This is an appropriate method given the assumptionof nearly parallel image and target planes.

Since the top and bottom gratings are printed in one dimension, the analysis will assume a one-dimensional problem focusing on two gratings. Each grating is represented as a square wave with a Fouriertransform of

Ff(y) =∞∑

i=−∞

sin πi2

πiδ

(f− i

λ

)(2)

consisting of impulses at integer multiple frequencies of the printed pattern. Given that the magnitude ofthe sine components decays inversely with frequency, a first order raised cosine model, i = 0,±1 in Equation2, can be used to approximate the square wave.

Modeling of the printed gratings using raised cosine function approximations to the square wave canbe seen in Figure 7. The figure represents the view from the side of the two superimposed gratings of theprinted wavelength λ at a distance apart d. We let the origin be the start of the gratings, x be the distanceto the camera focal point, y be the distance to the ray of light entering the camera from the top grating,and h be the altitude, i.e. the distance between the focal point and target surface. The angle between thevertical and the ray of light of interest is γ. Using these variables, the top and bottom pattern are modeledas

fT (y) =12

+12

cos(

kλy

)(3)

4 of 16

American Institute of Aeronautics and Astronautics

(b) Actual Target

Figure 2.3: Moire Target used by Tournier for UAV pose estimation

the target is generated by placing two similar-structured grid on top of each other by

a small distance. Tournier computed the ”altitude” of the camera by applying the

principle of similar triangles on the LED around the target, the ”yaw” by computing

how parallel the lines EB and DC are to the vertical, and the rest four parameters

by using the moire effect appearing on the target due to the camera movement.

Disadvantages

Although this approach showed good results in [16] and [36], there are some disad-

vantages to it if used in outdoor environments or in other real-life approaches:

1. Moire effects are sensitive to change in illumination, and perform poorly in out-

door environments. Because the primary UAV usage in our research is surveil-

lance, search and rescue, firefighting and etc, and these all occur in outside

environments, Moire pattern is not a good choice.

2. The pose estimation here is highly dependent to the target, and if the target is

not visible to the camera for a short period of time, the UAV will lose its control

due to the lack of sufficient knowledge about its pose. Also if we want to fly

Page 15: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

2.3. POSE ESTIMATION - RELATED WORKS 7

Figure 5. Coordinate Systems.

Figure 6. Calculation of altitude.

fB(y) =12

+12

cos(

kλ(y − d tan γ) + ψ

). (4)

The phase offset ψ accounts for lateral misalignment error of the gratings when placed one over the other.The k variable corrects for the refraction of light as it passes through the transparent acrylic sheets thatsandwich the printed grating sheets to hold them in place. The physical layout of the target will be describedfurther in Section V.

The overall intensity of the observed moire pattern, I, is computed through the multiplication of theequations of the individual grating intensities:

I =[12

+12

cos(

kλy

)] [12

+12

cos(

kλ(y − d tan γ) + ψ

)]. (5)

The following equations exist due to the geometry of the problem and from a trigonometric identity:

tan γ =x− y

h, (6)

(12

+12

cos a

) (12

+12

cos b

)=

14

+14

cos a +14

cos b +18

cos(a + b) +18

cos(a− b). (7)

5 of 16

American Institute of Aeronautics and Astronautics

(a) AltitudeFigure 10. Yaw angle Ψ is determined by observing how parallel the lines EB and DC are to the vertical.

Given that the apparent wavelength, ρ, of the gratings remains nearly constant on the image plane asdiscussed previously, the phase at the principle point can be computed by first finding the perpendiculardistance between P and the line where the gratings start in pixels, xg. Given xg and the known phase angleat the start of the gratings, the phase at the principle point, φp, is solved by

φp =2π

ρxg + φ− 2πn. (16)

Equation 15 is then used to compute the attitude angle. Similar to the position computation, the sameequation holds for the orthogonal set of fringes to calculate the angle about the other non-vertical axis.

Camera yaw, Ψ, is evaluated by observing how parallel the lines between feature points on the target areto the vertical. Using the red dots on the target, this implies

Ψ = tan−1

(Bv − Ev

Bu − Eu

)(17)

as seen in Figure 10.Since the target provides information for the exercise of landing, instances when the target is a great

lateral distance away and the effect of perspective is prominent are outside the scope of the purpose of thetarget. Therefore, it was not necessary to include perspective in Equation 17. To minimize any small effect,the yaw angle is averaged between the angles returned by using EB and DC.

IV. Algorithm

To extract all the information the target provides, the algorithm first captures an image and goes overeach pixel. It removes the distortion introduced by the camera, finds the red dots, determines the target’sorientation, finds the beginning of the gratings, knows the direction in which to acquire the data for theDFT, performs the DFT, unwraps the phase, and computes the 6 DOF of the camera relative to the target.The algorithm then repeats the process on the next captured frame. A flowchart of the algorithm can befound in Figure 11.

V. Integrated Vehicle System

To demonstrate the capabilities of this device in a real-time environment, we used this vision sensortechnology to provide real-time positioning data for a Draganflyer V Ti Pro12 (as shown in Figure 1) inan low-cost, indoor testing environment. The quadrotor is a four rotor helicopter that receives 4 inputcommands, one to each rotor. Two of the four rotors are counter-rotating to enable 6 degree of freedommovement. A wireless CMOS camera is attached to the vehicle pointing downward at the moire target.Figure 12 shows a diagram of the components and setup of the integrated system. Notice that all ofthe computing for this system is done on a ground-based computer, which had two AMD 64-bit Opertonprocessors, 2 Gb of memory, and ran Gentoo Linux. To command the vehicle, a circuit board was constructed

9 of 16

American Institute of Aeronautics and Astronautics

(b) Yaw

Figure 2.4: Calculation of Altitude and Yaw angle

in a higher altitude than the room ceiling in outdoor environments, the target

and its grating will become too small and its tracking will become impossible

(Unless we significantly increase the target size). So, finding the pose while

tracking a fixed target is not a good idea for outdoor environment use either.

2.3.2 Two Ground Cameras

Andrew NG used 2 ground stereo pair cameras and a machine-learning approach for

helicopter pose estimation and flight control. His projects were performed by a real

helicopter flying in outdoor environments and showed very good results (the exper-

iment movies are available in his website)2. Because of using two ground cameras

, this approach can not be used in unknown strategic environments for surveillance

tasks.

2http://heli.stanford.edu/

Page 16: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

8 CHAPTER 2. INTRODUCTION

2.3.3 Two Cameras: One Ground And One Onboard

Erdinc Altug and Camillo Taylor [5] also used 2 cameras. They placed a pan/tilt/zoom

camera on the ground and another one on the quadrotor, looking downward to the

ground camera. Their research goal was to find the quadrotor position and attitude

with respect to the ground camera rather than finding the quadrotor pose by itself. In

order to find the camera pose, some colored markers with different colors and known

radius (2.5cm in their experiment) were attached under each rotor, on the center of

the camera and also on the top of the ground camera as illustrated in Figure 2.6.

By tracking the markers and computing their areas, the relative pose to the camera

could be computed.

Disadvantages

1. In order to use in real-life situations, the markers must be very big to be rec-

ognizable from a far distance, and this makes the approach unusable in critical

situations like surveillance.

2. There is no other way to find the quadrotor pose when the ground camera is

out of sight of the quadrotor camera. This makes the approach very sensitive

to any possible occlusion in outdoor environments.

3. Having the ground camera is not possible for some missions like fire fighting or

search and rescue.

2.3.4 Other Approaches

Beside cameras, it is possible to use other sensors to equip the system with extra useful

information in order to increase the localization precision. Using Gyro to find the

attitude is a great choice when the weight limit is not an issue. Three accelerometers

in three perpendicular axes can also be used to estimate position Wii). Finally, laser

Page 17: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

2.3. POSE ESTIMATION - RELATED WORKS 9II. H ELICOPTERPOSEESTIMATION

For surveillance and remote inspection tasks a relativeposition and orientation detection is important. Our goal isto obtain the pose from vision rather than complex and heavynavigation systems or GPS. The purpose of the pose estimationmethod is to obtain the relative position and orientationof the helicopter with respect to the ground camera. Twocamera pose estimation method uses a pan/tilt/zoom groundcamera and an onboard camera. Previous work on vision basedpose estimation utilizes monocular views or stereo pairs. Ourtwo camera pose estimation method involves the use of twocameras that are set to see each other. Colored blobs of 2.5cm radius are attached to the bottom of the quadrotor andto the ground camera as shown in Figure 2. A blob trackingalgorithm is used to get the positions and areas of the blobs onthe image planes. Therefore, the purpose of the pose estimationalgorithm is to obtain (x, y, z) positions, tilt angles (θ ,ψ) andthe yaw angle (φ ) of the helicopter in real-time relative to thecamera frame.

Fig. 2. Quadrotor Tracking with a Camera

The pose estimation can be defined as:Problem: Find Rotation Matrix,R ∈ R3×3, defining the

body fixed frame of the helicopter with respect to the fixedframe located at the ground camera frame, whereRTR =I ,det(R) = I , relative position~p∈R3 which is the position ofthe helicopter with respect to the ground camera and velocities~w and~V.

In this section, we will introduce the two-camera poseestimation algorithm, and compare it through simulations toother pose estimation algorithms.

A. Two Camera Pose Estimation Method

The two camera pose estimation method uses a pan-tiltground camera and an on-board camera. Previous work onvision-based pose estimation utilizes monocular views orstereo pairs. Our two camera pose estimation method involvesthe use of two cameras that are set to see each other. Thismethod is especially useful for autonomous taking off orlanding. Especially when the relative motion information iscritical, such as landing on a ship at rough seas. Colored blobsof 2.5 cm radius are attached to the bottom of the quadrotorand to the ground camera as shown in Figure 3. Trackingtwo blobs on the quadrotor image plane and one blob on theground image frame is found to be enough for accurate poseestimation. To minimize the error as much as possible, fiveblobs are placed on the quadrotor and a single blob is located

on the ground camera. The blob tracking algorithm tracks theblobs and returns image values(ui ,vi) for i = 1· · ·6.

3

6

5

4

1

2

Pan/Tilt

xb

zb

x

z

OnboardCamera

w1

w2

Quadrotor

w3

yb

Ground Camera

Fig. 3. Two-Camera Pose Estimation Method using a Pair of Ground andOnboard Cameras

The cameras have matrices of intrinsic parameters,A1 andA2. The unit vector~wi ∈ R3 from each camera to the blobscan be found as

~wi = inv(A1).[ui vi 1]′, ~wi = ~wi/norm(~wi)f or i = 1,3,4,5,6 (1)

~w2 = inv(A2).[u2 v2 1]′, ~w2 = ~w2/norm(~w2)

Let~La be the vector pointing from blob-1 to blob-3 in Figure 3.Vectors~w1 and~w3 are related by

λ3~w3 = λ1~w1 +R~La (2)

whereλ1 andλ3 are unknown scalars. Taking the cross productwith ~w3 gives

λ1(~w3×~w1) = R~La×~w3. (3)

This can be rewritten as(~w3×~w1)× (R~La×~w3) = 0. (4)

Let the rotation matrixR be composed of two rotations: therotation of θ degrees around the vector formed by the crossproduct of~w1 and ~w2 and the rotation ofα degrees around~w1. In other words

R = Rot(~w1×~w2,θ) ·Rot(~w1,α) (5)

whereRot(~a,b) means the rotation ofb degrees around the unitvector~a. The value ofθ can be found from the dot productof vectors~w1 and~w2.

θ = acos(~w1 ·~w2) (6)

The only unknown is the angleα. Let M be a matrixdescribed as

M = (~w3×~w1)× (~w3× (R(~w1×~w2,θ))). (7)

(a) markers

Method Ang. E. (deg) Posn. E. (cm)

Direct M. 10.2166 1.55754 Pnt. M. 3.0429 3.0807

2 Camera M. 1.2232 1.2668Linear M. 4.3700 1.8731Stereo M. 6.5467 1.1681

TABLE I

COMPARISON OF THEANGULAR AND POSITIONAL ERRORS OF

DIFFERENTPOSEESTIMATION METHODS

(φ , θ , ψ), representing yaw, roll (rotation aroundy-axis) andpitch (rotation aroundx-axis), respectively.

φ

θψ

lB

y x

zb

yb

mg

F

F

F

M

3

2

4

1

F

Front M2

O

4z

xb M1M3

Fig. 6. 3D Quadrotor Model

A spinning rotor produces moment as well as thrust. LetFi

be the thrust andMi be the moment generated by rotori, thatis spinning with rotational speed ofwi .

Let Vb ∈ B be the linear velocity in body-fixed frame andwb ∈ B the angular velocity. Therefore the velocities will be

Vb =RT p (13)

skew(wb) =RTR (14)

where skew(w) ∈ so(3) is the skew symmetric matrix ofw.To represent the dynamics of the quadrotor, one can write theNewton-Euler equations as follows

mVb =Fext−wb×mVb (15)

Ibwb =Mext−wb× Ibwb. (16)

Fext and Mext are the external forces and moments on thebody-fixed frame.Ib is the inertia matrix, andm is the massof the helicopter.

Drag on a moving object [12] is given byDrag= 12Cdρv2A,

in which ρ is the density of air,A is the frontal area,Cd is thedrag coefficient, andV is the velocity. Assuming constantρ,the constants at the above equation can be combined to formC, which simplifies drag toDrag = Cv2.

The force generated by a rotor [12] which is spinning withrotational velocity ofw is given byF = bL = ρ

4 w2R3abc(θt −

φt), whereb is the number of blades on a rotor,θt is the pitch atthe blade tip,φt is the inflow angle at the tip. By combining theconstant terms as constant variableD, this equation simplifiesto Fi = Dwi

2.ThereforeFext andMext will be

Fext =−Cxx2i−Cyy

2 j +(T−Czz2)k−R·mgk (17)

Mext =Mxi +My j +Mzk (18)

whereCx, Cy, Cz are the drag coefficients alongx, y and zaxes, respectively.T is the total thrust andMx, My, Mz arethe moments generated by the rotors. The relation of thrustand moments to the rotational velocities of rotors is given asfollows

TMx

My

Mz

=

D D D D−Dl Dl Dl −Dl−Dl −Dl Dl DlCD −CD CD −CD

w12

w22

w32

w42

.

(19)The above matrixM ∈R4×4 is full rank for l ,C,D 6= 0. The

rotational velocity of rotori (wi), can be related to the torqueof motor i (τi) as

τi = Ir wi +Kwi2 (20)

where Ir is the rotational inertia of rotori, K is the reactivetorque due to the drag terms.

Motor torquesτi should be selected to produce the desiredrotor velocities (wi) in Equation 20, which will change theexternal forces and moments in Equation 18. This will producethe desired body velocities and accelerations in Equation 14.

A. Control

A controller should pick suitable rotor speedswi for thedesired body accelerations. Let us define the control inputs tobe

u1 = (F1 +F2 +F3 +F4)u2 = l(−F1 +F2 +F3−F4) (21)

u3 = l(−F1−F2 +F3 +F4)u4 = C(F1−F2 +F3−F4).

C is the force-to-moment scaling factor. Theu1 represents atotal thrust on the body in the z-axis,u2 andu3 are the pitchand roll inputs andu4 is a yawing moment. Backsteppingcontrollers [13] are useful when some states are controlledthrough other states. Since motions along thex and y axesare related to tilt anglesθ and ψ respectively, backsteppingcontrollers given in [1] can be used to control tilt anglesenabling the precise control of thex andy motions (inputsu2

and u3). The altitude and the yaw, can be controlled by PDcontrollers

u1 =g+Kp1(zd−z)+Kd1(zd− z)

cosθ cosψ(22)

u4 = Kp2(φd−φ)+Kd2(φd− φ).

(b) Quadrotor model

Figure 2.5: Using the color markers under the quadrotor and 2 cameras onboard andon the ground, pose of the UAV can be computed

range finders have also been used to improve the performance of pose estimation.

Since the quadrotor we planned to use in this research was equipped with an on-

board camera, and since the weight limit did not let us to have extra sensors on board,

we chose the single camera approach, and used MonoSLAM [12]. A brief sketch of

MonoSLAM is presented here and in Chapter 4 the algorithm is explained in details.

MonoSLAM

MonoSLAM is a camera based approach of SLAM, first introduced by Andrew Davi-

son [10]. Before this method was introduced, other sensory information such as laser

range finders were used in SLAM.

The camera mounted on the Robot captured a series of pictures from the environment

as it moves around, and detects the high-quality features inside them. These image

patches are then used in the regular SLAM method as discussed in [14] and [8].

Unlike the others, MonoSLAM relies more on localization rather than finding the

precise map of the room. This method stores just enough number of features to find

Page 18: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

10 CHAPTER 2. INTRODUCTION

(a) Nintendo Wii (b) iPhone

Figure 2.6: Using Accelerometers in wii and iphone for motion analysis

the robot’s position, thus creating a very sparse map (only around 100 total features),

so the output is a very sparse map and the robot’s pose during its motion.

To decrease the uncertainty in the pose, Davison chose to initialize the map in the first

frame with four known features (by giving their appearances and true locations), and

also he gave the starting pose of the camera. Hence, both the map and pose data series

start with known data, which help the system have a sense of depth at the start point.

Disadvantages

1. If we want to use MonoSLAM in real time (30Hz for example), we cannot have

more than 100 features in the map, which is enough for pose estimation inside

a small room but not in bigger indoor or outdoor environments.

2. The other setback of this approach is its fragility against moving objects. The

key point of the algorithm is repeatable observation, and if features move in-

side the scene, reobservation of them can result in closing uncorrect loops and

unreasonable localization. Moreover, as a result of this problem, it’s almost

Page 19: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

2.3. POSE ESTIMATION - RELATED WORKS 11

Introduction Davison’s MonoSLAM The SceneLib Libraries Final Thoughts

The MonoSLAMGlow Application

Failed matchSuccessful match

Unused feature

Figure 2.7: GUI view of MonoSLAM software.

impossible to use the robot for tracking (a moving) object inside the scene.

We study possible ways to overcome these disadvantages in Chapter 7, and believe

this approach is currently the best one in the literature for our purpose.

Page 20: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

Chapter 3

Essential Preliminaries

In this chapter, we review the fundamental concepts and techniques, which are neces-

sary for understanding the MonoSLAM algorithm. We demonstrate some probability

estimation techniques, Markov sequence, Monte Carlo simulation, Bayes theorem,

Dynamic Bayesian Network (DBN) and also the fundamental foundation of spatial

relationships.

3.1 Markov Sequence

Papoulis[27] defined Markov process as a random process, in which its upcoming prob-

abilities are only dependent on their most recent values. In other words, a stochastic

process x(t) is Markovian if it has the following properties for all n and t1 < t2... < tn:

P (x(tn) ≤ xn|x(tn−1), ..., x(t1)) = P (x(tn) ≤ xn|x(tn−1))

This is equivalent to:

P (x(tn) ≤ xn|x(t), ∀t ≤ tn−1) = P (x(tn) ≤ xn|x(tn−1))

([27], p.535).

These equations signify the fact that the current state value has all the information

12

Page 21: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

3.2. CONDITIONAL MEAN AND VARIANCE IN GAUSSIAN DISTRIBUTION13

we need and the future value of such a variable is independent of its past. So, the

process with this property is very computationally efficient.

Markov sequences also have the following property:

p(x(t+ 1)) =∫p(x(t+ 1)|x(t))p(x(t))dx(t),

which is used in SLAM prediction step as explained in Chapter 4.

3.2 Conditional Mean and Variance in Gaussian

Distribution

For jointly Gaussian random vectors y and z, the conditional distribution of y given

z is also Gaussian with the following moments [34].

Conditional mean:

E[y|z] = E[y] + CyzC−1z (z − E[z])

Conditional variance:

Cyy|z = Cyy − CyzCz−1Czy,

Where:

Cyz = E[(y − E[y])(z − E[z])T ]

3.3 EKF

Unlike the regular Kalman Filter, which tries to estimate the state of a controlled

linear process, EKF (Extended Kalman Filter) addresses the problem of estimating

the state of nonlinear processes, where the state transition and/or observation models

are nonlinear functions. EKF linearizes these two nonlinear functions about their

mean vector and covariance matrix, similar to Taylor series[17].

Page 22: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

14 CHAPTER 3. ESSENTIAL PRELIMINARIES

Assume the process to be estimated is:

xk = f(xk−1, uk, wk−1),

and the measurement is:

zk = h(xk, vk),

where wk and vk represent the process and measurement gaussian zero-mean noise

with the given covariance, xk and zk represent the state and observation at time k

respectively, f is the process model and h is the measurement model.

EKF algorithm consists of two parts of time update and measurement update. The

complete set of equations is given below.

EKF time update(prediction) equations:

(1) Project the state ahead:

x−k = f(xk−1, uk, 0)

(2) Project the error covariance ahead:

P−k = AkPk1ATk +WkQk1W

Tk

Where Ak and Wk are the process jacobians at step k, and Qk is the process noise

covariance at step k.

EKF measurement update(correction) equations:

(1) Compute the Kalman gain:

Kk = P−k HTk (HkP

−k H

Tk + VkRkV

Tk )1

(2) Update estimate with measurement zk:

xk = x−k +Kk(zk − h(xk, 0))

Page 23: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

3.4. MONTE CARLO SIMULATION 15

(3) Update the error covariance Pk:

Pk = (I −KkHk)P−k

Where Hk and Vk are the measurement Jacobians at step k, and Rk is the mea-

surement noise covariance at step k.

3.4 Monte Carlo Simulation

The Monte Carlo method was invented by S.Ulam and N.Metropolis in 1949 [38],

for solving problems using random numbers and probability distributions. This tech-

nique is usually used when we want to simulate physical and mathematical systems

when their system models are nonlinear, have lots of uncertain parameters, or are too

complicated to solve analytically [2].

The Monte Carlo simulation iteratively samples from the probability distribution

we defined for the uncertain input variables and uses these samples to feed the system

and evaluate the result.

In order to do this, we must try to find the probability distribution that matches

our input data as close as possible, and also expresses our knowledge about these

data precisely.

Monte Carlo simulation algorithm has the following steps[]:

Step1: model the system.

Step2: find the input probability distribution, generate data from it.

Step3: run the system with this data set and check the result.

Step4: iterate step 1 to 3 for n times.

Step5: dissect the result.

Page 24: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

16 CHAPTER 3. ESSENTIAL PRELIMINARIES

Monte Carlo sampling gives improved estimates of the output, as the input sam-

pling continues.

3.5 Bayes Theorem

In the probability theory, Bayes’ theorem simplifies the mathematical formula used

for calculating conditional probabilities, which makes the simple connection between

the conditional and marginal probabilities of two random variables or events[30].

P (A|B) =P (B|A)P (A)

P (B),

where:

P (A|B): posterior, the probability of the state of nature being A given that feature

value B has been measured

P (B|A): likelihood, the likelihood of B with respect to A

P (A): prior, probability of the state of nature being A before any observation

The Bayes formula is often used to compute posterior probabilities given observa-

tions (prior) as below:

posterior=(likelihood × prior)/normalizing constant.

3.5.1 DBN

Dynamic Bayesian Network[24] is a specific type of BN which was built to solve the

stochastic process cases and represents sequences of variables (usually time series),

DBNs are directed graph models of these kind of processes. HMM (Hidden Markov

Model) is a simple example of DBN [20].

BN (Bayesian or belief network), is a simple tool for representing and reasoning

about uncertainty and belongs to the graphical model family (GM). It visualizes how

different variables are dependent to each other, represents what we know about an un-

certain domain and helps to simplify the complex problems. Bayesian network B is an

Page 25: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

3.5. BAYES THEOREM 17

X(0) X(1) X(2)

Z(0,1)

Z(0,2)

Z(1,1) Z(1,2)

Z(2,1)

Z(2,2)

L1 L2Landmarks

Observations

UAV states

Control inputs u2u1

Figure 3.1: DBN for the SLAM problem. Ui:control input applied at time (i− 1) to derivethe system to state X(i), Z(i, j): observation of Lj made at time i.

acyclic graph of the form B = 〈G,Θ〉 which corresponds to the joint probability dis-

tribution over a set of random variables, where G is a DAG (Directed Acyclic Graph)

with nodes:X1, ...Xn and Θ represents the network parameters θXi|πi= PB(Xi|πi)

where πi represents the parent nodes. B defines the unique joint probability distribu-

tion given below:

PB(X1, X2, .., Xn) =n∏

i=1

PB(Xi|πi) =n∏

i=1

θXi|πi

Page 26: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

18 CHAPTER 3. ESSENTIAL PRELIMINARIES

The Bayesian Networks can be made following these 3 steps:

Step 1: adding variables we want to be included in the network as nodes Xi

Step 2: adding the links from the parent (πi) to child. Each link represents a rela-

tionship between two nodes, note that the graph must remain acyclic

Step 3: for each node X defining a probability table containing P (X|π) for all com-

bination of its parents

Figure 3.1 illustrates the DBN of SLAM problem.

3.6 Shi and Tomasi

Shi and Tomasi introduced another feature tracking algorithm that obtains the fea-

tures of an image at time t+τ based on features of image at time t [31]. This method

works by moving every point in the earlier image by an appropriate displacement vec-

tor δ = Dx+ d, where d is the features’ translation vector and D is the deformation

matrix defined by:

D =

dxx dxy

dyx dyy

Denote the current image by I(x, t) and the upcoming image by I(Ax + d, t + τt),

where A = I2×2 + δ. Shi and Tomasi’s idea was to find an affine transformation that

minimizes the dissimilarity,(ε), between these two images:

ε =∫ ∫

W[I(Ax+ d)− I(x)]2dx,

and W is a window around the feature in the current image, where this residual is

being computed. In the simplest case, D is equal to the identity matrix and nd the

Page 27: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

3.7. SPATIAL RELATIONSHIPS 19

only important quantity is the translation vector d. By setting the derivative of ε with

respect to d equal to zero and linearizing the system by truncated Taylor expansion,

the following linear system will be obtained:

Gd = e

G =∑

W

I2

x IxIy

IxIy I2y

, e =

W

It(Ix Iy

)

The method finds the good features to track by computing the eigenvalues of the

matrix G, (λ1 and λ2), and checking them against a predetermined threshold, (λ),

that is if min min(λ1, λ2) > λ [35].

3.7 Spatial Relationships

3.7.1 AT

Smith [29] defines AT (uncertain or Approximate Transformation) as an estimate

between two different coordinate frames, companying with a covariance matrix which

shows how much uncertainty this estimate has.

Each AT represent an observation (or other kind of sensing) in a relative motion.

Smith used AT(A)=[x, y, θ] (where θ is the rotation about z-axis), to express the

relative location of landmark A to the world reference frame, and AT(AB) to express

the relative location of landmark A coordinate frame to landmark B’s.

In Figure 3.2 (from [29]), Smith and Cheeseman used an arrow (form the reference

frame to the landmark) to represent the AT and an ellipse(representing the two

dimensional Gaussian distribution) around the landmark to show its uncertainty.

Page 28: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

20 CHAPTER 3. ESSENTIAL PRELIMINARIES

L1

L2

L3

L4

W

A

B

C

E

F

G

S

D

Figure 3.2: A sequence of approximate transformations. line: AT, ellipse: AT’s uncertainty,solid:gained by movement, dashed:gained by observation.

3.7.2 Coordinate Frame Relationships

Having different ATs in different coordinate frames, we need to estimate the relative

AT between these frames. To do so, we must use some operations to combine infor-

mation from the middle steps and produce a single AT([29],[28]).

We can write any relationship as:

y = g(x)

Page 29: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

3.7. SPATIAL RELATIONSHIPS 21

Where g() is a general function consisting of a desired combination of Compound,

Merge and Reversal operations. The estimate of the first two moments two moments,

i.e. mean and covariance matrix, of this function is given by:

y ≈ g(x)

C(y) ≈ GxC(x)GTx

Where Gx ≡ ∂f(x)∂x

(x)

Compounding:⊕

Compounding operation allows us to recursively compress a chain of subsequent ATs

and get a single AT. By using this operation, in each step (combining two AT) the

uncertainty of the resultant AT with respect to the two ATs it is made from will in-

crease. This is similar to moving with closed eyes in an unknown environment. With

each step you take, the uncertainty about where you are with respect to where you

started increases.

Compounding formulas for a pair are given by:

xik = xij ⊕ xjk =

xjk cos(θij)− yjk sin(θij) + xij

xjk sin(θij) + yjk cos(θij) + xij

θij + θjk

, xm =

xm

ym

θm

,

compounded mean, is:

xik ≈ xij ⊕ xjk

and compounded covariance matrix is:

C(xik) = J⊕

C(xij) C(xij,xjk)

C(xjk,xij) C(xjk)

JT⊕ ,

Page 30: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

22 CHAPTER 3. ESSENTIAL PRELIMINARIES

where Jacobian of the compounding operation is:

J⊕ =∂(xij ⊕ xjk)∂(xij, xjk)

=∂xik

∂(xij, xjk)=

1 0 −(yik − yij) cos(θij) − sin(θij) 0

0 1 (xik − xij) sin(θij) cos(θij) 0

0 0 1 0 0 1

= [ J1⊕ J2⊕ ]

Merging:⊗

The merging operation is used to combine parallel ATs between the same pair of coor-

dinate frames. These parallel ATs occur as a result of having more than one source of

data: sensors which collect data and also UAV’s odometry data after each movement.

So the sensor result and the final approximation (computed with ⊕ operation) give

us two different pieces of information about the same landmark, and combining these

two will decrease the uncertainty.

Merging a pair of transformations X1 and X2 is showed by:

X3 = X1 ⊗X2.

In order to find the merged mean and covariance matrix, the Kalman gain factor is

computed as:

K = C1 × [C1 + C2]−1,

using this gain,we can compute merged covariance matrix:

C3 = C1 −K × C1,

and merged mean:

X3 = X1 +K × (X2 − X − 1),

where:

C1, C2 and X1, X2 are covariance matrices and mean vectors of the two ATs we want

to merge respectively, and C3 and X3 are covariance matrix and mean vector of the

Page 31: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

3.7. SPATIAL RELATIONSHIPS 23

resulting merged pair.

Reversal:

For merging and compounding, all ATs must be in the right direction. Otherwise,

using the reversal operation we can correct the direction (sense) of the AT Xij =

(xij, yij, θij) which is in the wrong direction.

Reversal formulas:

Xji = Xij =

−xij cos(θij)− yij sin(θij)

xij cos(θij)− yij sin(θij)

−θij

Reversal mean:

xji ≈ xij

Reversal covariance matrix:

C(Xji) = JC(Xij)JT

Where the jacobian of the reversal operation is:

J ≡∂Xji

∂Xij

=

− cos(θij) sin(θij) yji

sin(θij) − cos(θij) −xji0 0 −1

.

Figure 3.3 shows an example of above three operations.

3.7.3 Stochastic Map

World is represented as a 3D surface, made up of different patches. Each patch

corresponds to a 3D object with photometry and geometry(shape) properties and a

coordinate frame which defines each patch’s pose in the world. Smelyanskiy et al

showed in [39] that the property of each patch can completely be learnt by using

Page 32: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

24 CHAPTER 3. ESSENTIAL PRELIMINARIES

Compounding:

Merging:

Reversal:

X_01

X_01

X_01

X_01X_01

X_12 X_02

X_10

Figure 3.3: Compounding, Merging and Reversal operations. line: AT, ellipse: AT’suncertainty, solid:gained by movement, dashed:gained by observation, Xij: AT fromframe i to frame j

Bayesian Super Resolution technique[32]. Learning these properties are usefull espe-

cially when examining the observed landmark to find out if it has been seen before

or is a new one. To know the environment completely, UAV needs to know each

patch’s pose too. The most popular way for this aim is the EKF method, where the

world is represented as a set of features and each feature is the pose of a single patch

mentioned above. EKF tracked a mean state vector which consists of the UAV pose

followed by the pose of each feature respectively [33]. Figure 3.4 illustrates a simple

example for the world map.

Page 33: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

3.7. SPATIAL RELATIONSHIPS 25

square triangle circleUAV

Mean state vector:

Map: Features:

Figure 3.4: map of 3D world, each feature represented as a 6DOF vector (3 DOF positionand 3 DOF orientation)

Because we do not know exactly where the landmarks and UAV are, we must show

the objects in a stochastic map which includes uncertain spatial relationships, their

uncertainties and a scale to show the inner-dependencies between these uncertainties.

The uncertain spatial relationship can be represented as a probability distribution

around its spatial variable, and because this distribution is unknown, we can model

it by just estimating its first two moments (mean x and covariance matrix C(x)) and

use this normal distribution to approximate the real distribution.

For the 2D case map is made with the following 2 steps algorithm, assuming no sensor

is provided, and the world is 2D:

step 1: Adding the first relation to the empty map

Page 34: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

26 CHAPTER 3. ESSENTIAL PRELIMINARIES

X = X1 =

x

y

φ

, C(X) =

σ2x σxy σxφ

σxy σ2y σyφ

σxφ σyφ σ2x

step 2: Adding a new object, one at a time:

X ′ =

x

xn

, C(X ′) =

C(X) C(X,Xn)

C(Xn, X) C(Xn)

3.7.4 Updating The Map

The map can change if one of the followings situations happen:

1. the world itself changes (note: UAV is part of the world, if it moves the world will

change, and because we assumed landmarks are stationary in our research, the world

can change only by UAV movement).

2. our knowledge about the world changes, for example by observing a landmark

again and more precisely (loop closure problem).

We have estimated the world map by its first two moments, so in order to change

Figure 3.5: updating the map

Page 35: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

3.7. SPATIAL RELATIONSHIPS 27

the map we must change and update these two moments, Figure 3.5 (from [28])

shows the change in map due to change of the world and adding new observations

(measurements).

Smith et al. assumed in [28] that the new observations are applied at discrete time k,

and their effect is immediate. they showed the moments before the observation by x(−)k

and C(x(−)k ) and right after it by x

(+)k and C(x

(+)k ). By making a new observation, our

knowledge about the world will be increased and this will decrease the uncertainty

of the observed landmark and all the world respectively (direct result of merging

operation).

For finding x(+)k and C(x

(+)k ) mathematically, we must model our sensor (camera in

our case) somehow to describe the effect of the sensor on mapping the spatial variables

(xi:2D image patches) into the sensor variables (zi:pose of each landmarks). Camera

model is defined as function h, and because the measurement is noisy we add an

independent gaussian zero mean noise v with a given covariance to it:

z = h(x) + v

The estimated map after this observation, using the formula in Section 3.2 is:

x|z = x+ C(x, z)C(z)−1(z − z)

C(x|z) = C(x)− C(x, z)C(z)−1C(x, z)

where x(+)k = x|z and x

(−)k = x

If the above covariance matrices C(x) and C(x, z) are to be replaced by their approx-

imations, then we would get the Kalman Filter equations given below:

x+k = x−k +Kk[zk − hk(x−k )],

C(x+k ) = C(x−k )−KkHxC(x−k ),

Kk = C(x−k )Hx/6t[HxC(x−k )HTx + C(v)k]

−1.

Page 36: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

28 CHAPTER 3. ESSENTIAL PRELIMINARIES

Between the two steps of updating the sensor (i.e time k − 1 and k), UAV can

move. Assume the UAV is the Rth relationship in the map, so it’s current estimated

location is defined by xR, future location(after movement) by x′R and the relative

motion it makes to get to this place by yR, note that this relative motion is uncertain.

So, using the compounding formula given in Section 3.5.2 and assuming xR and yR

have independent errors, the new state vector and covariance matrix can be inferred

as follows:

x′R = xR ⊕ yRx′R ≈ xR ⊕ yRC(x′R) ≈ J1⊕C(xR)JT1⊕ + J2⊕C(yR)JT2⊕

C(x′R, xi) ≈ J1⊕C(xR, xi)

and the new map will become:

x′ =

x′R

, C(x′R) =

C(x, x′R)

C(x′R, x) C(x′R)

Page 37: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

Chapter 4

SLAM

4.1 Abstract

Simultaneous localization and mapping (SLAM) is the problem of being in an un-

known location in an unknown environment and simultaneously learning the environ-

ment (mapping) and determining the location accurately and repeatedly (localization)

using this map. So, in order to solve the slam problem, we must answer the following

questions:

1.Where am I?

Localization is the act of providing the map to UAV, where UAV is required to find

its location with respect to the objects in the given map.

2.What does the world around me look like?

Mapping on the other hand is when the UAV location is known, and we ask it to find

all the object’s locations in the environment surrounding it(create the sparse map of

the room) with respect to its own location.

The main problem here is, to map a feature in the world map UAV needs to know

its own location; and to find its own location, UAV needs the accurate map of the

29

Page 38: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

30 CHAPTER 4. SLAM

1 2

3

5

4

Figure 4.1: 1:start the prediction step,2: estimate new location, 3:add process noise, 4:startthe measurement step, measure features, 5:update positions and uncertainties

world! And this is a bootstrapping problem.

So, Slam is the act of the UAV building a map of the environment around it and

using this map to find its location at the same time.

4.2 Implementation

SLAM is considered as a solved problem now([14] and [8]), and can estimate the

trajectory of the sensor (which is located on the robot, and hence has the same

Page 39: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

4.2. IMPLEMENTATION 31

Table 4.1: SLAM Notationsxk state vector of UAV, including its location and orienta-

tion (pose) at time k.X0:k = X0:k−1, xk history of UAV pose, up to and including time k.

uk the control input(applied at time k-1) which derivedUAV to the state xk(at time k).

U0:k = U0:k−1, uk history of input controls, up to and including time k.

zik observation(sensor input) of the ith landmark at time k.zk = z1k, ... observation of all the landmarks at time k.

Z0:k = Z0:k−1, zk history of observed landmarks, up to and including timek.

mi estimated location of the ith landmark, we assumedall the landmarks are stationary (their true locationsdoesn’t change with time).

m = m1,m2, ...,mn Set of all landmarks(Notice that the map parametersdo not have a time subscript as they are modelled asstationary)

trajectory as the robot) and locations of all the landmarks in the environment without

any prior information.

4.2.1 Notations

In order to solve the SLAM problem Bailey and Durrant-Whyte [14] introduced the

notations showed in Table 4.1 .

4.2.2 Observation Model

P (zk|xk,m)

The observation model describes the effect of the observation. It computes the probability

of making the kth observation of the environment (zk), assuming we have the location of

the UAV (xk) and the map of the room (m), and describes how the uncertainty can be

reduced. If we precisely know the true map, the new observations will not change it, so in

Page 40: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

32 CHAPTER 4. SLAM

that case making observations is independent from the map of the world.

4.2.3 Motion Model

P (xk|xk−1, uk)

The motion model describes the effect of the control input on state transition and calculates

how the uncertainty increases. Motion model is function of the previous state (xk−1) and

the control input which derived us to the current state (uk). Since the probability of xk is

just dependent on xk−1, this model is Markov and we can use its properties (as mentioned

in Section 3.1) later in the SLAM algorithm.

4.3 Probabilistic SLAM

Hugh and Bailey defined the SLAM problem as the computation of the joint posterior

probability P (xk,m|Z0:k, U0:k, x0)(or simply P (xk,m)) for the UAV state xp and map m,

based on all the observation and control inputs for all time t in [14].

To do so, they introduced the following two-steps recursive algorithm, in which the first

step is the time update and the second one is the measurement-update.

4.3.1 Step one: Prediction(time-update)

When UAV moves, we need to estimate its new position and the uncertainty of its location

(knowing that this uncertainty is monotonically increasing). The motion model used in this

step is:

P (xk,m|Z0:k−1, U0:k, x0) =∫P (xk|xk−1, uk)× P (xk−1,m|Z0:k−1, U0:k−1, x0)dxk−1

4.3.2 Step two: Correction(measurement-update)

The aim of this step is adding new feature to the map and remeasure the previously added

features. When UAV observes a feature which was previously in the map, it needs to update

the system using Bayes Theorem and Markov property as follows:

P (xk,m|Z0:k, U0:k, x0) =P (zk|xk,m)P (xk,m|Z0:k−1, U0:k, x0)

P (zk|Z0:k−1, U0:k−1)

Page 41: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

4.4. EKF-SLAM 33

This formula is a function of the vehicle model P (xk|xk−1, uk) and observation model

P (zk|xk,m).

And when UAV observes a new feature (which is not in the map), using the inverse of

observation equation (P (mi|xk,m)) its initial position will be estimated.

4.4 EKF-SLAM

EKF-SLAM is the most popular solution to the SLAM problem. It uses a linearized Gaus-

sian probability distribution model to approximate the posterior as a Gaussian distribution,

where the Mean (µk) contains the state vector (which is the location of the robot and map

of the environment), and covariance matrix (Σk) estimates the uncertainties and demon-

strates how the elements of state vector are correlated to each other.

A posteriori state estimate equation represents the mean of the state distribution:

µk =

xk|k

mk

= E

xk

m|Z0:k

,

and a posteriori estimate error covariance equation represents the variance of the state

distribution:

Σk = Pk|k =

Pxx Pxm

P Txm Pmm

=

xk − xkm− mk

xk − xkm− mk

T

|Z0:k

Hence, the joint probability distribution which was previously used can be defined as

the following Normal equation, knowing the additive process and measurement noises are

white:

P (xk,m|Z0:k, U0:k, x0)⇔ N(µk,Σk)

Defining the posterior as this normal distribution, we can apply EKF to compute the

mean and covariance matrix of the joint posterior distribution P (xk,m|Z0:k, U0:k, x0).

Page 42: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

34 CHAPTER 4. SLAM

Table 4.2: EKF-SLAM Notationsf motion modelh measurement model

wk = N(0, Qk) additive, zero mean, uncorrelated Gaussian motion dis-turbances with covariance Qk

vk = N(0, Rk) additive, zero mean, uncorrelated Gaussian observationerrors with covariance Rk

Motion model

The motion model describes change of robot’s state with time as:

P (xk|xk−1, uk)⇔ xk = f(xk−1, uk) + wk

Observation model

Observation model predicts the measurement value given robot’s state as:

P (zk|xk,m)⇔ zk = h(xk,m) + vk

where the notations are explained in table 4.2.

4.4.1 Time Update(Predict)

The purpose of this step is to update the vector state, as the landmarks or the robot itself

may move during the learning process. So, there is no need to compute the time update

if the landmarks and robot are stationary and do not move. In our case all the landmarks

are stationary and just the robot can move, and when it moves using this step we can find

its new position and also the uncertainty of its location, noting that positional uncertainty

always increases. Time update performs via the following two steps:

step 1: project the state ahead:

xk|k−1 = f(xk|k−1, uk)

Page 43: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

4.4. EKF-SLAM 35

step 2: project the error covariance ahead:

Pxx,k|k−1 = ∇fPxx,k|k−1∇fT +Qk

4.4.2 Observation Update(Correct)

This step aims to add new features to map and remeasure the existing features by observing

them again:

step 1: update the estimate with observation zk:

xk|k

mk

= [xk|k−1mk−1] +Wk[zk − h(xk|k−1, mk−1)]

step 2: update the error covariance:

Pk|k = Pk|k−1 −WkSkWTk ,

where the Kalman gain is:

Sk = ∇hPk|k−1∇hT +Rk

Wk = Pk|k−1∇hTS−1k ,

and ∇h is the jacobian of h computed at xk|k−1 and mk−1.

4.4.3 Advantages and Disadvantages

EKF-SLAM solution inherits many of the standard EKF benefits and problems, the most

important advantages of this method are its simplicity and the proven experience of working

well in practice.

The problems of EKF-SLAM mostly arise because firstly EKF-SLAM represents the es-

timate of state by the mean and its uncertainty by a covariance matrix and because we get

Page 44: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

36 CHAPTER 4. SLAM

these two moments by using the linearization they are approximate and not exactly the true

moments of the distribution. And secondly we assumed the map distribution is uni-modal

gaussian and so it can be represented by its first two moments but in fact this distribution

is not gaussian [9].

Consistency

Bailey et al. claimed in[9] the most important source of inconsistency of EKF-SLAM is due

to the heading uncertainty (variance). They examined their hypothesis in different scenarios

(moving and stationary vehicle) and with different heading uncertainty and concluded that

with a relatively small heading uncertainty (less than 1.7 degree), inconsistency is time

dependent and its effect can be weakend by adding some stabilising noise. But for Large

variance, failure occurs after a few updates and so it can not be prevented by adding noise

or any other method.

Julier and Uhlmann, on the other hand, showed in [37] the map built by the full covariance

SLAM algorithm is always inconsistent, even in the case of stationary vehicle and no process

(motion) noise. And the reason people working in this area didn’t mention it in the previous

studies was the fact that this inconsistency appears after about thousands of updates, and

the works that have been performed in the past had only simulated for the first few hundred

steps in which the divergence is not noticeable.

So, inconsistency becomes evident when either the ground truth is available or the loop

becomes big (e.g. greater than 100m), and it can even have more effect on divergence than

computational complexity problem [26].

Convergence

Huang and Dissanayake showed the error of robot heading has an important role in limit

or lower bound of the uncertainty of the estimation of the landmark locations, and so has

a significant effect on convergence analysis of EKF-SLAM. They stated and proved the

following theorems for the convergence of 2D EKF-SLAM in [13], and also for finding the

limit of covariance matrices and limit of inconsistency:

Theorem 1: The determinant of any submatrix of the map covariance matrix decreases

monotonically as successive observation are made.

Page 45: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

4.4. EKF-SLAM 37

Theorem 2: If robot keeps stationary and observe a new landmark many times, the robot

uncertainty keeps unchanged.

Theorem 3: If robot keeps stationary and observes two new landmarks many times, the

robot uncertainty keeps unchanged.

Theorem 4: In EKF SLAM, if the robot keeps stationary at point A and observing a new

landmark k times, the inconsistency may occur due to the fact that Jacobians are evaluated

at different estimation values. When k →∞, the inconsistency may cause the robot orien-

tation error to be reduced to zero.

Whyte and Bailey showed in [14] the map will converge if the map covariance matrix

and all landmark’s pair submatrices monotonically converged to zero.

The distinct landmark’s variances inherit the initial uncertainty of robot position and ob-

servations and converge to the lower bound monotonically as shown in Figure 4.2 from [14].

40 50 60 70 80 90 100 1100

0.5

1

1.5

2

2.5

Time (s)

Sta

ndar

d D

evia

tion

in X

(m

)

Figure 4.2: The convergence in landmark uncertainty. The plot shows a time history ofstandard deviations of a set of landmark locations. A landmark is initially observed withuncertainty inherited from the robot location and observation. Over time, the standarddeviations reduce monotonically to a lower bound. New landmarks are acquired duringmotion

Page 46: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

38 CHAPTER 4. SLAM

In short, they claimed the Convergence and consistency can only be guaranteed in the

linear case.

Loop Closure

As the UAV moves inside the environment and explores the map around it, the localization

error will increase, therefor the uncertainty in UAV position will increase because of this

accumulated error. As a direct result of this problem, the map will diverge gradually. To

overcome the divergence, UAV must be able to recognize the positions which it was in before

and use their new observations to correct the map estimate and reduce the uncertainty of

the map and its pose. This process in SLAM is known as the loop closure step. Gutmann

and Konolige [18] illustrated the loop closing result in Figure[].

Since the map will become larger after adding new landmarks in each iteration, and also,

since in online processing we need each step to be computed in a little time to have near

real time execution, all the computations must be performed with little cost. Hence closing

the loop must be performed with little computational cost, and in ideal case it does not

need a separate algorithm.

EKF based SLAM needs a separate algorithm to identify loops and closing the loop requires

a very expensive computation of O(N2) as shown in [23]. So, loop closure remains one of

the main unsolved issues with EKF-SLAM.

Figure 4.1: Before and after closing a loop. Courtesy of [34].

Many types of sensors (e.g. sonar or a single laser point rangefinder), provide sparse amounts of

data fast while acquiring a more dense amount of data usuallyrequires several seconds. This impacts

the map building process since it is usually advantageous toobtain a multitude of points for data

association. More points equals more possible features, while sparse data sets allows only a sparse

reconstruction and hinders the point matching process due to the lack of features.

In order to accomodate these issues with data association, the use of statistics is typically employed to

determine outliers and to find the ’best’ match to the data given sets of noisy measurements.

Most SLAM algorithms cannot cope with a failure to properly associate measurements to map fea-

tures. For instance, in a Kalman filter based SLAM algorithm,if too many measurements are incorrectly

assigned, the filter will diverge. Hahnel [36] explicitly addresses this problem and constructs a theoretical

framework for performing a “lazy” data association. This involves maintaining a tree that represents the

possible data associations at each update. The log-likelihood for each association is stored in the tree.

58

Figure 4.3: Before and after closing the loop, this figure is from [19]

Page 47: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

4.4. EKF-SLAM 39

Data Association

Data association, is the process of relating the observed features through the sensor(camera

in our case) to the features in the existing map. If we associate these observations incorrectly,

the filter update will be based on incorrect data, and if this incorrect association happens

for more and more features, the EKF filter will diverge and can not be corrected [25].

Incorrect data association can occur if we use an inaccurate or imprecise model for the

sensor, if the measurements(observations) are too noisy, or simply when the features are

not stationary in the environment as we previously assumed and can move.

Kalman filter based SLAM can not cope with the incorrect data association problem and

will diverge.

Computational Complexity

The most computational cost of EKF-SLAM is because of computing the Kalman gain

in the observation update step. Due to computing the inverse matrix its cost is of the

order of O(N2). Hence, EKF update procedure needs a quadratic number of operation

and its computational complexity is O(N2) where N is the number of features in map. So,

EKF-SLAM can not be used in the case of big maps (i.e. big N).

Uni-modal Gaussian

Because the map is assumed to have ”Uni-modal Gaussian Probability” distribution in all

the cases, missions like the one in Figure ?? can not be completed. For solving this problem

the EKF approach must be replaced with other methods.

Nonlinearity

In EKF, nonlinear functions are approximated with their linear approximation. So the filter

can diverge if the function’s nonlinearity is large.

Page 48: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

Chapter 5

Experiments

The main focus of this research is to find the correct pose of an autonomous-flying quad

rotor in a room, and to use this information to better control the device.

The quad rotor we used, was equipped with a single onboard camera, and MonoSLAM was

used as the pose finder, with the onboard camera as sensor. MonoSLAM is described in

Section 5.1. Davison’s C++ library was used to implement MonoSLAM 1.

Ground truth of quad rotor pose was checked by VICON cameras installed in robust control

lab; a brief overview of VICON is presented in Section 5.2.

In order to simulate the flight of the UAV, we used a handheld camera. The result of the

experiments are presented in Section 5.3.

5.1 MonoSLAM

This section presents the result of Davison’s work on the MonoSLAM problem [12]. In

MonoSLAM, the main aim of the process is finding the robot position (localization) rather

than building the complete map of the room, and the robot will add just as much feature

to the map as it is necessary to find its position in the unknown world around it. So, the

output map is a sparse set of high quality landmarks including around 100 features for a

limited space in the room. Davison assumed the scene is rigid and hence the features inside

the scene are stationary.

1The latest version of this software is available at his website:http://www.doc.ic.ac.uk/ ajd/software.html

40

Page 49: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

5.1. MONOSLAM 41

The camera state vector xv is a 13 parameter vector consisting of a 3-element vector rW

representing the 3D metric position, a 4-element vector qRW representing the orientation

quaternion, and 3-element vectors vW and wR representing velocity and angular velocity

respectively. The camera is modelled as a rigid body with its pose completely described by

its translation and rotation:

xv =

rW

qRW

vW

wR

,

where W indicates the ”world frame” and R illustrates the ”robot frame”.

Map of the world x, consists of camera state vector(xv) and feature’s position (yi). The

uncertainty of this map is showed by the covariance matrix P :

x =

xv

y1

y2

...

, P =

Pxx Pxy1 Pxy2 · · ·Py1x Py1y1 Py1y2 · · ·Py2x Py2y1 Py2y2 · · ·

......

.... . .

The map is made by the image patches with their corresponding feature positions (mean)

and an elliptical shape around each feature to show their uncertainty bound (standard de-

viation) in 3D space.

If the map had N features, the size of the map would have become O(N2) because of the

size of matrix P , and the MonoSLAM complexity is also O(N2). Because of the real time

processing, for 30Hz implementation, the possible number of features in map is around

100 features, and Davison showed in [11] 100 well-chosen features are sufficient to span the

limited volume of the room where the experiment is taking place in.

As mentioned in the last chapter, close features (which can be observed at the same time)

are related to each other with a good approximation, but the group position has a large

Page 50: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

42 CHAPTER 5. EXPERIMENTS

uncertainty. Which means the elliptical bounds mentioned above are correlated and this

legitimizes the non-zero off diagonal elements of covariance matrix.

MonoSLAM is based on repeatable observation of features from different viewpoints. Vari-

ous subsets of features at different depths will be covisible as the camera moves freely inside

the limited 3D space and different size loop closure will be occurred; so knowing the de-

tailed correlation between the map’s subsets is very important for solving the loop closure

problem, Davision et al. showed that among all the known methods,”Standard single, full

covariance EKF approach” is the most computationally feasable method which can achieve

the goal of ”long term, repeatable localization within restricted volumes”.

Davison and Maury used large image patches(11x11 pixels) as the long-term landmarks,

and used the Shi and Tomasi feature detection approach for finding the features. Instead of

using the normal template matching for recognizing the features after reobservation, they

assumed each landmark is placed on a locally planar surface and approximated this surface’s

normal with the vector oriented from the feature to the camera. Because the real orienta-

tion is unknown at initialization, each feature will be stored as oriented planar texture after

being fully initialized. In order to detect the initialized feature from a different viewpoint

when the camera pose changes, they projected the image patch to the predicted place where

they expected to find the feature (using the new camera 3D position and orientation) and

made the new template and compared it with the existing image patch in each location

until a peak is found 5.1.

In order to update the normal of each feature’s plane in 3D map, as the camera moves

and the new estimate of its pose becomes available, the warped version of the initial image

patch will be computed using the following formula:

H = CR[nTxpI − tnT ]C−1,

where: C= camera’s calibration matrix (perspective projection),

R= camera rotation matrix,

t= camera translation matrix,

n= surface normal,

xp= projection of patch’s cente,

Page 51: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

5.1. MONOSLAM 43

coordinate, and that their effect on the parameters of the line isnegligible. This is a good approximation because the amountof uncertainty in depth is very large compared with theuncertainty in the line’s direction. While the feature isrepresented in this way with a line and set of depthhypotheses we refer to it as partially initialized. Once wehave obtained a good depth estimate in the form of a peakeddepth PDF, we convert the feature to “fully initialized” with astandard 3D Gaussian representation.

At each subsequent time step, the hypotheses are all testedby projecting them into the image, where each is instantiatedas an elliptical search region. The size and shape of eachellipse is determined by the uncertain parameters of the line:Each discrete hypothesis at depth has 3D world locationyi ¼ rWi þhWi . This location is projected into the image viathe standard measurement function and relevant Jacobians ofSection 3.5 to obtain the search ellipse for each depth. Notethat, in the case of a nonperspective camera (such as the wide-angle cameras we normally use), the centers of the ellipseswill not lie along a straight line, but a curve. This does notpresent a problem as we treat each hypothesis separately.

We use an efficient algorithm to make correlationsearches for the same feature template over this set ofellipses, which will typically be significantly overlapping(the algorithm builds a look-up table of correlation scores sothat image processing work is not repeated for the over-lapping regions). Feature matching within each ellipseproduces a likelihood for each, and their probabilities arereweighted via Bayes’ rule: The likelihood score is simplythe probability indicated by the 2D Gaussian PDF in imagespace implied by the elliptical search region. Note that, inthe case of many small ellipses with relatively smalloverlaps (true when the camera localization estimate isvery good), we get much more resolving power betweendifferent depth hypotheses than when larger, significantlyoverlapping ellipses are observed, and this affects the speedat which the depth distribution will collapse to a peak.

Fig. 4 illustrates the progress of the search over severalframes and Fig. 5 shows a typical evolution of thedistribution over time, from uniform prior to sharp peak.When the ratio of the standard deviation of depth to depthestimate drops below a threshold (currently 0.3), thedistribution is safely approximated as Gaussian and thefeature initialized as a point into the map. Features whichhave just crossed this threshold typically retain large depthuncertainty (see Fig. 1a which shows several uncertaintyellipsoids elongated along the approximate camera viewingdirection), but this shrinks quickly as the camera moves andfurther standard measurements are obtained.

The important factor of this initialization is the shape ofthe search regions generated by the overlapping ellipses. Adepth prior has removed the need to search along the entireepipolar line and improved the robustness and speed ofinitialization. In real-time implementation, the speed ofcollapse of the particle distribution is aided (and correlationsearch work saved) by deterministic pruning of the weakesthypotheses at each step, and during typical motions around2-4 frames is sufficient. It should be noted that most of theexperiments we have carried out have involved mostlysideways camera motions and this initialization approachwould perform more poorly with motions along the opticaxis where little parallax is measured.

Since the initialization algorithm of this section was firstpublished in [5], some interesting developments to theessential idea have been published. In particular, Sola et al.[46] have presented an algorithm which represents theuncertainty in a just-initialized feature by a set of overlapping3D Gaussian distributions spaced along the 3D initializationline. Appealing aspects of this approach are first thedistribution of the Gaussians, which is uniform in inversedepth rather than uniform in depth as in our technique—thisappears to be a more efficient use of multiple samples. Also,their technique allows measurements of new featuresimmediately to have an effect on refining the cameralocalization estimate, improving on our need to wait untilthe feature is “fully-initialized.” Most recently, Montiel et al.[47] have shown that a reparametrization in terms of inversedepth permits even more straightforward and efficientinitialization within the standard EKF framework, in anapproach similar to that used by Eade and Drummond [42] ina new FastSLAM-based monocular SLAM system.

DAVISON ET AL.: MONOSLAM: REAL-TIME SINGLE CAMERA SLAM 1059

Fig. 4. A close-up view of image search in successive frames during feature initialization. In the first frame, a candidate feature image patch isidentified within a search region. A 3D ray along which the feature must lie is added to the SLAM map, and this ray is projected into subsequentimages. A distribution of depth hypotheses from 0.5 m to 5 m translates via the uncertainty in the new camera position relative to the ray into a set ofellipses which are all searched to produce likelihoods for Bayesian reweighting of the depth distribution. A small number of time-steps is normallysufficient to reduce depth uncertainly sufficiently to approximate as Gaussian and enable the feature to be converted to a fully-initialized pointrepresentation.

Fig. 5. Frame-by-frame evolution of the probability density over featuredepth represented by a particle set. One hundred equally-weightedparticles are initially spread evenly along the range 0.5 m to 5.0 m; witheach subsequent image measurement the distribution becomes moreclosely Gaussian.

Figure 5.1: The evolution of depth probability density from uniform distribution toGaussian, points are uniformly distributed over the image patch’s normal line from0.5 to 5m. Courtesy of [12].

3.7 Map Management

An important part of the overall algorithm is sensible

management of the number of features in the map, and on-

the-fly decisions need to be made about when new features

shouldbeidentifiedandinitialized,aswellaswhenitmightbe

necessary to delete a feature. Our map-maintenance criterion

aims to keep the number of reliable features visible from any

camera location closetoapredeterminedvaluedeterminedby

the specifics of the measurement process, the required

localization accuracy and the computing power available:

we have found that with a wide-angle camera a number in the

region of 12 gives accurate localization without overburden-

ing the processor. An important part of our future work plan

is to put heuristics such as this on a firm theoretical footing

using methods from information theory as discussed in [45].Feature “visibility” (more accurately, predicted measur-

ability) is calculated based on the relative position of thecamera and feature and the saved position of the camera fromwhich the feature was initialized. The feature must bepredicted to lie within the image, but also the camera mustnot have translated too far from its initialization viewpoint ofthe feature or we would expect correlation to fail (note that wecan cope with a full range of rotation). Features are added tothe map only if the number visible in the area the camera ispassing through is less than this threshold—it is undesirableto increase the number of features and add to the computa-tional complexity of filtering without good reason. Featuresare detected by running the image interest operator of Shi andTomasi to locate the best candidate within a box of limitedsize (around 80 60 pixels) placed within the image. Theposition of the search box is currently chosen randomly, withthe constraints only that it should not overlap with anyexisting features and that based on the current estimates ofcamera velocity and angular velocity any detected featuresare not expected to disappear from the field of viewimmediately.

A feature is deleted from the map if, after a predeter-mined number of detection and matching attempts when thefeature should be visible, more than a fixed proportion (inour work, 50 percent) are failures. This criterion prunesfeatures which are “bad” for a number of possible reasons:They are not true 3D points (lying at occlusion boundariessuch as T-junctions), lie on moving objects, are caused byspecular highlights on a curved surface, or importantly arejust often occluded.

Over aperiod of time, a“natural selection” of features takesplace through these map management criteria which leads toa map of stable, static, widely-observable point features.Clutter in the scene can be dealt with even if it sometimesoccludes these landmarks since attempted measurements ofthe occluded landmarks simply fail and do not lead to a filterupdate. Problems only arise if mismatches occur due to asimilarity in appearance between clutter and landmarks andthis can potentially lead to catastrophic failure. Note,however, that mismatches of any kind are extremely rareduring periods of good tracking since the large featuretemplates give a high degree of uniqueness and the activesearch method means that matching is usually only attemptedwithin very small image regions (typically, 15-20 pixelsacross).

3.8 Feature Orientation Estimation

In Section 3.2, we described how visual patch features

extracted from the image stream are inserted into the map as

oriented, locally-planar surfaces, but explained that the

orientations of these surfaces are initially just postulated,

this proving sufficient for calculating the change of appear-

ance of the features over reasonable viewpoint changes. This

is the approach used in the applications presented in

Sections 4 and 5.

In this section, we show as in [7] that it is possible to go

further, and use visual measurement within real-time

SLAM to actually improve the arbitrarily assigned orienta-

tion for each feature and recover real information about

local surface normals at the feature locations. This improves

the range of measurability of each feature, but also takes us

a step further toward a possible future goal of recovering

detailed 3D surface maps in real-time rather than sets of

sparse landmarks.Our approach shares some of the ideas of Jin et al. [48]

who described a sequential (but not real-time) algorithm

they described as “direct structure from motion” which

estimated feature positions and orientations. Their concept

of their method as “direct” in globally tying together featuretracking and geometrical estimation is the same as the

principles of probabilistic SLAM and active search used

over several years in our work [5], [27]. They achieve

impressive patch orientation estimates as a camera moves

around a highly textured object.

Since we assume that a feature corresponds to a locally

planar region in 3D space, as the camera moves its image

appearance will be transformed by changes in viewpoint by

warpingtheinitial templatecapturedforthefeature.Theexact

nature of the warp will depend on the initial and current

positions of the camera, the 3D position of the center of the

feature, and the orientation of its local surface. The SLAM

system provides a running estimate of camera pose and

3D feature positions. We now additionally maintain estimates

of the initial camera position and the local surface orientation

for each point. This allows a prediction of the feature’s warped

appearance from the current viewpoint. In the image, we then

make a measurement of the current warp, and the difference

between the prediction and measurement is used to update the

surface orientation estimate.Fig. 6a shows the geometry of a camera in two positions

viewing an oriented planar patch. The warping which

1060 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 6, JUNE 2007

Fig. 6. (a) Geometry of a camera in two positions observing a surface

with normal n. (b) Processing cycle for estimating the 3D orientation of

planar feature surfaces.

Figure 5.2: The camera poses in 2 different viewpoints looking at the same imagepatch with normal n. Courtesy of [12].

I= 3x3 identity matrix,

which corresponds to its predicted appearance from the current viewpoint. The difference

Page 52: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

44 CHAPTER 5. EXPERIMENTS

between this predicted amount and the measurement is used to update the feature’s sur-

face normal direction. Davison assumed in [12] the feature’s plane normal direction is just

weakly correlated with camera and feature positions, and therefore maintains a separate

2-parameter EKF for estimating the normal direction for each feature.

To aid the system in its computation to assign a precise scale to the estimated map

and to help it to estimate the camera motion with a good approximation in the starting

frames, the map is initialized with 4 features (corners of a letter sized paper for example)

with known position and known appearance. Also the initial position of the camera in the

first frame (the viewpoint from which the initialized features are stored) is known.

Gathering several measurements from different viewpoints is required for estimating the

feature’s depth. To find the depth, a semi-infinite 3D line is defined after initializing the

feature, starting from the camera position and passing through the feature point and going

to infinity. The true location of the feature lies on this line and the Gaussian uncertainty of

this ray is showed by an elliptical shape around it. The representation of this line is showed

by the vector ypi as follows, where ri is the position of the camera and hwi is a vector which

describes its orientation:

ypi =

rWi

hWi

To realize if the recognized feature is a new one (so it can be initialized) or an exist-

ing one (so the knowledge about its depth can be updated), a set of uniformly distributed

depth hypothesis will be defined along this semi-infinite line, the image patch (stored in

initialization step) will be projected in each discrete place to make new image templates,

and the observed feature will be compared with them using 2D template matching tech-

nique. The location of the hypothesis corresponding to feature yi at depth λ has the form

yλi = rWi + λhWi , and its Gaussian uncertainty is showed with an ellipse around this point.

Therefore, at the beginning, for each feature we have a set of ellipses uniformly distributed

along a line with their centers lying on that line. After reobserving a feature and finding

its likelihood to be which one of these hypothesis along the line, the probability of each

hypothesis will be recalculated until the correct depth is found.

The robot must decide when a new feature can be added to the map and when the

Page 53: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

5.1. MONOSLAM 45

existing feature can be deleted from it while it is flying inside the room and grabbing new

frames. Davison showed it is sufficient to have around 12 visible features in each frame for

accurate localization. A new feature will be added if the number of visible features become

less than this threshold and will be deleted from the map after more than 50% failure of

detecting it in a predetermined number of frames. Since the image patches are relatively

large, it is very rare to have a poor match between a true feature and an occlusion due to

their appearance similarity.

5.1.1 Motion

As mentioned in Chapter 3, the SLAM algorithm consists of two steps: prediction and

update. For MonoSLAM algorithm these two steps are as follows:

1) prediction step:

Using the standard pinhole model for the camera, the location (u, v) we expect to find the

feature can be found from:

hi =

u

v

=

u0 − fkuhR

Lx

hRLz

v0 − fkuhR

Ly

hRLz

,

where (f, ku, kv, u0, v0) are camera calibration parameters and hRL is the expected feature

position relative to camera in the robot’s frame:

hRL = RRW (yWi − rW ).

2) update step:

The camera motion model which Davison used in this project is constant velocity, constant

angular velocity, assuming the unknown accelerations occur with a Gaussian profile cause

an impulse of velocity (V W ) and angular velocity (ΩR) in each step:

n =

V W

ΩW

=

aW∆t

αR∆t

Page 54: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

46 CHAPTER 5. EXPERIMENTS

Introduction Davison’s MonoSLAM The SceneLib Libraries Final Thoughts

Extended Kalman Filter: Motion models

Constant velocityAssume bounded, Gaussian-distributed linear and angularacceleration

xnew =

rWnew

qWRnew

vWnew

ωRnew

= f (x , u) =

rW +(

vW + V W)

∆tqWR × q

((ωR + ΩR)

∆t)

vW + V W

ωR + ΩR

Figure 5.3: Camera Motion Model

Using the above motion model, the new state estimate update fv(xv, u) and its uncertainty

Qv can be computed as follows:

fv =

rWnew

qWRnew

vWnew

wRnew

=

rW + (vW + V W )∆t

qWR × q((wR + ΩR)∆t)

vW + V W

wR + ΩR

,

Qv =∂fv∂n

Pn∂fv∂n

T

,

where q((wR + ΩR)∆t) is the quaternion defined by the rotation vector (wR + ΩR)∆t, n

is the noise vector and Pn is its diagonal covariance matrix representing the uncorrelated

noise between all state’s components. The motion uncertainty growth rate is determined

by the size of Pn, the covariance of noise vector; and smoothness of the motion is defined

by its parameters value. Small Pn corresponds to a smooth motion with small acceleration

(which can not fulfill the rapid movement), high Pn on the other hand can cope with this

kind of motion but needs us to make lots of good measurement in each step.

Page 55: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

5.2. VICON 47

5.2 Vicon

Vicon system is designed for optical motion capturing by tracking the infrared reflective

markers attached on the object of interest, which reflect the infrared light emitted by a ring

of infrared LEDs around the lens of each camera.

Vicon systems consists of 4 main parts:

1. MX Cameras: system includes 8 high resolution high speed cameras installed

around the room. Each camera outputs raw IR point values reflected by the markers

to the ultranet system as shown in Figure 5.4.

Figure 5.4: Vicon camera.

2. Ultranet MX: which links the MX cameras and the host computer.

3. Host computer.

4. Infrared reflective markers: which are available in 4 different sizes as shown in

Figure 5.5. The key point in using the Vicon is placing these markers somehow to

give us the opportunity to predict the complete motion from their positions. Because

the camera is a rigid body without any joint, the model we used for it is a simple

box, with a marker on the center of each side as showed in Figure [].

By fitting the box model into camera motion, its exact pose will be found for checking

the ground truth of the MonoSLAM algorithm output.

Page 56: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

48 CHAPTER 5. EXPERIMENTS

Figure 5.5: Vicon markers, from [3].

5.3 Experimental Results

We used ”Panasonic PV-GS500 ” to run the experiment and ”Vicon cameras” to check

the ground truth of MonoSLAM program. Camera calibration was performed using the

standard software and calibration grid [1] and the following parameters were obtained:

Focal length: fku = fkv = 941.5 pixels;

Principal point: (u0, v0) = (426.0, 239.5);

Distortion factor: K1 = 0.001;

Image size: 853× 480.

The experiments were done inside the robust control lab, and data sequence were collected

while camera was looking downward to the floor and moving in a restricted volume inside

the room.

In our setup, several challenges made the task harder. The problem here was that usually

there is not enough object on the floor, and the features are mostly tile corners which

are very similar to each other and can lead the system to incorrect loop closure. And also

because we wanted to move the camera inside the space which is covered by Vicon’s camera,

it was not possible to move the camera very high, so using the objects located on the desk

and tables was not possible either. We made a cluttered space full of objects on the floor

to help the system.

Camera localization was performed using Davison’s MonoSLAM open source library, and

the ground truth was checked by Vicon Cameras. The output of MonoSLAM for some

discrete times is shown in Figure 5.6. Each figure consists of the map of the room and

the detected features for the corresponding frames. Camera trajectory for a sample run is

shown in Figure 5.9 and the errors in x and y dimensions are shown in Figure 5.7.

The output of Vicon cameras has also an error with the average standard deviation of

(0.0194, 0.0154, 0.0261) in x, y and z dimensions respectively in this experiment, therefor

the error illustrated in Plot 5.7 is not only due to MonoSLAM program. We can see the

Page 57: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

5.3. EXPERIMENTAL RESULTS 49

difference between the output of MonoSLAM and output of Vicon has a big peak at the

beginning and get less as the program continued.

Page 58: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

50 CHAPTER 5. EXPERIMENTS

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k)

Figure 5.6: The output of MonoSLAM for a sample sequence, with 4 startup features.The map is built gradually by updating the visible features in each frame. Blue markscorrespond to failed matches, red marks correspond to successful matches and yellowmarks correspond to unused features in current frame.

Page 59: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

5.3. EXPERIMENTAL RESULTS 51

200 400 600 800 1000 1200 1400 1600 1800 2000 2200

−2

0

2

4

6

8

10Robot Localization Error in "x" Dimension

Iteration

x(V

icon

) −

x(M

onoS

LAM

)

(a) Robot localization error in”x” dimension.

200 400 600 800 1000 1200 1400 1600 1800 2000 2200−6

−4

−2

0

2

4

6

8

10Robot Localization Error in "y" Dimension

Iteration

y(V

icon

) −

y(M

onoS

LAM

)

(b) Robot localization error in”y” dimension.

200 400 600 800 1000 1200 1400 1600 1800 2000 2200

−3

−2

−1

0

1

2

3

Robot Localization Error in "z" Dimension

Iteration

z(V

icon

) −

z(M

onoS

LAM

)

(c) Robot localization error in”z” dimension.

Figure 5.7: MonoSLAM Error in X, Y and Z Dimension.

(a) Orientation in x-y plane (b) Orientation in y-z plane (c) Orientation in z-x plane

Figure 5.8: UAV orientation, MonoSLAM and Vicon outputs

Page 60: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

52 CHAPTER 5. EXPERIMENTS

−300 −250 −200 −150 −100 −50 0 50−100

−80

−60

−40

−20

0

20

40

60

80

1002D Trajectory in xy plane

x

y

ViconMonoSLAM

Figure 5.9: Comparing the output trajectories in 2D space, from Vicon andMonoSLAM.

Page 61: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

Chapter 6

Conclusions

In this research, we studied different approaches for finding the pose of the flying quadrotor

inside the room. The disadvantages of each approach in real life situations were explained

and finally the Davison’s MonoSLAM approach was chosen for the localization step in

”Vision-Based UAV Control” research.

This was the starting point for our UAV controlling approach, and we showed MonoSLAM

can generate the reliable result to be used in control step.

53

Page 62: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

Chapter 7

Discussion and Future Works

7.1 Discussion

7.1.1 Which Feature to Add

Davison relied on finding high quality landmarks, and added a new one when the number

of observable features in each frame was below a predefined threshold (e.g 12).

It is possible to have the most high quality features, placed close to each other in an

environment. This will cause having most of the landmarks overcrowded in some part

of the map and no information in some other parts. Davison’s assumption for having 12

landmarks in each frame will help a lot to solve this problem, but it’s better to find features

scattered around the room equally. This can also help us to run MonoSLAM in bigger

environments.

The reasonable solution is when we want to add a new landmark to our map, not look for

them in a defined space around the existing ones 7.1. This will prevent the program from

finding close high quality features (which can not add that much useful information to the

localization problem).

7.1.2 Moving Objects

When we use the SLAM algorithm for localization, after building the complete map, occlu-

sion, acceleration and any unexpected movement can be handled by the system and they

can not affect the pose estimation process.

54

Page 63: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

7.2. FUTURE WORKS 55

(a) (b) (c)

Figure 7.1: a: existing map of the room; 10 points stars are the features, darker colorcorresponds to high quality features and lighter colors corresponds to low qualityones. 4 points pink stars are the selected landmarks. b: updated map, selecting 2other landmarks by only taking into account the quality of features. b: updated map,selecting features by quality and distance from the existing features in the map.

Davison assumed the initial position of 4 features with known appearance, and the initial

pose of the camera(robot) are given at the starting frame. If, unlike Davison, we assume

a high resolution picture of the environment (or some part of it) is available [21] with no

additional information, it is possible to use classification tree [7] and [22]. This can solve

the problem of moving objects inside the map (by giving an initial overview of the map in

the starting point). Additionally, this can help the SLAM algorithm with its extra localiza-

tion information [7]. This can also solve the problem of template matching algorithm, by

providing more templates for each landmark (similar to what Davison did in [12]).

7.2 Future works

The next step of this research is unmanned controlling of the quadrotor inside the room, and

also we *will* try to note the solutions mentioned previously in discussion part to improve

the UAV localization results.

Page 64: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

Bibliography

[1] Camera calibration toolbox. http://www.vision.caltech.edu/bouguetj/calib_

doc/.

[2] Monte carlo simulation. http://www.vertex42.com/ExcelArticles/mc/

MonteCarloSimulation.html.

[3] Vicon introduction. http://grouplab.cpsc.ucalgary.ca/cookbook/index.php/

Toolkits/ViconIntroduction.

[4] Wikipedia. http://en.wikipedia.org/wiki/Helicopter.

[5] C. Altug, E. Taylor. Vision-based pose estimation and control of a model helicopter.

Mechatronics, 2004. ICM ’04. Proceedings of the IEEE International Conference, pages

316–321, June 2004.

[6] Nobuyuki Kita Andrew J. Davison, Yolanda Gonzalez Cid. Real-time 3d SLAM with

wide-angle vision. 2004.

[7] Yanghai Tsin Aurelien Boffy and Yakup Genc. Real-time feature matching using adap-

tive and spatially distributed classification trees. BMVC 2006, September 2006.

[8] H. Bailey, T.; Durrant-Whyte. Simultaneous localization and mapping (SLAM): part

ii. Robotics and Automation Magazine, IEEE, 13(3):108–117, Sept. 2006.

[9] J.; Guivant J.; Stevens M.; Nebot E. Bailey, T.; Nieto. Consistency of the EKF-SLAM

algorithm. Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference

on, pages 3562–3568, Oct. 2006.

56

Page 65: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

BIBLIOGRAPHY 57

[10] A. Davison. Real-time simultaneous localisation and mapping with a single camera.

In Proceedings of the ninth international Conference on Computer Vision ICCV’03,

October 2003.

[11] A. Davison, W. Mayol, and D. Murray. Real-time localisation and mapping with

wearable active vision. In Proceedings of the IEEE International Symposium on Mixed

and Augmented Reality, 2003.

[12] Ian D. Molton Nicholas D. Stasse Olivier Davison, Andrew J. Reid. MonoSLAM:

Real-time single camera SLAM. Pattern Analysis and Machine Intelligence, IEEE

Transactions, 29(6):1052–1067, June 2007.

[13] Shoudong Huang; Gamini Dissanayake. Convergence analysis for extended kalman

filter based SLAM. Robotics and Automation, 2006. ICRA 2006. Proceedings 2006

IEEE International Conference on, pages 412–417, May 2006.

[14] T. Durrant-Whyte, H.; Bailey. Simultaneous localization and mapping: part i. Robotics

and Automation Magazine, IEEE, 13(2):99–110, June 2006.

[15] Emin Gabrielyan. The basics of line moir patterns and optical speedup. 2007.

[16] Mario Valentiy Glenn P. Tournier and Jonathan P. How. Estimation and control

of a quadrotor vehicle using monocular vision and moire patterns. AIAA Guidance,

Navigation, and Control Conference and Exhibit, August 2006.

[17] Gary Bishop Greg Welch. An introduction to the kalman filter. 2001.

[18] J. Gutmann and K. Konolige. Incremental mapping of large cyclic environments.

Proceedingsof the IEEE International Symposium on Computational Intelligence in

Robotics and Automation (CIRA), pages 318–325, November 1999.

[19] J. Gutmann and K. Konolige. Incremental mapping of large cyclic environments.

Proceedingsof the IEEE International Symposium on Computational Intelligence in

Robotics and Automation (CIRA), pages 318–325, November 1999.

[20] D. Heckerman. A tutorial on learning with bayesian networks, 1995.

[21] S.; Panuganti R. Joshi, M.V.; Chaudhuri. A learning-based method for image super-

resolution from zoomed observations. Systems, Man, and Cybernetics, Part B, IEEE

Transactions on, 35:527 – 537, June 2005.

Page 66: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

58 BIBLIOGRAPHY

[22] P. Fua P. Lepetit, V. Lagger. Randomized trees for real-time keypoint recognition.

Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society

Conference on, 2:775 – 781, June 2005.

[23] R. Martinelli, A.; Siegwart. Exploiting the information at the loop closure in SLAM.

Robotics Robotics and Automation, 2007 IEEE International Conference on, pages

2055–2060, 10-14 April 2007.

[24] Kevin P. Murphy. Dynamic bayesian networks. 12 November 2002.

[25] J. Neira and J. Tardos. Data association in stochastic mapping using the joint com-

patibility test. Robotics and Automation Magazine, IEEE, 17(6):890–897, Dec 2001.

[26] Jos Neira. Multivehicle mapping of large environments. IROS 2005 Advanced Tutorial

in SLAM, August 2005.

[27] Athanasios Papoulis. Probability, Random Variables and Stochastic Processes.

McGraw-Hill Companies, 3rd edition, 1991.

[28] Peter Cheeseman Randal Smith, Matthew Self. A stochastic map for uncertain spatial

relationships. 1988.

[29] Peter Cheeseman Randall C. Smith. On the representation and estimation of spatial

uncertainty. The International Journal of Robotics Research, 5(4):56–68, 1986.

[30] David G. Stork Richard O. Duda, Peter E. Hart. Pattern Classification. Wiley-

Interscience, 2nd edition, October 2000.

[31] Jianbo Shi and Carlo Tomasi. Good features to track. In IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR’94), Seattle, Jun 1994.

[32] V. Smelyanskiy, P. Cheeseman, D. Maluf, and R. Morris. Bayesian super-resolved

surface reconstruction from images. pages 375–382.

[33] Doron Tal. 3d model-based SLAM from 2d images. Fall 2003.

[34] Charles W. Therrien. Discrete Random Signals and Statistical Signal Processing. Pren-

tice Hall, 1992.

Page 67: Vision-based UAV pose estimation - Northeastern University1267/fulltext.pdf · Switzernet Sàrl, Scientific Park of Swiss Federal Institute of Technology, Lausanne (EPFL) emin.gabrielyan@switzernet.com

BIBLIOGRAPHY 59

[35] T. Tommasini, A. Fusiello, V. Roberto, and E. Trucco. Robust feature tracking in

underwater video sequences. In Proceedings IEEE Oceans, pages 46–50, page 4650,

September 1998.

[36] Glenn P. (Glenn Paul) Tournier. Six degrees of freedom estimation using monocular

vision and moir patterns, 2006.

[37] Simon J. Julier; Jeffrey K. Uhlmann. A counter example to the theory of simultaneous

localization and map building. Robotics and Automation Magazine, IEEE, pages 4238–

4243, 2001.

[38] Nicholas Metropolis; S. Ulam. The monte carlo method. Journal of the American

Statistical Association,, 1949.

[39] D. A. Maluf R. D. Morris V. N. Smelyanskiy, P. Cheeseman. Bayesian super-resolved

surface reconstruction from images. Computer Vision and Pattern Recognition, 2000.

Proceedings. IEEE Conference on, 1:375–382, 13-15 June 2000.