Custom hardware architectures for embedded high-performance and low-power SLAM · 2020. 2. 5. ·...

Imperial College of Science, Technology and Medicine

Department of Electrical and Electronic Engineering

Custom hardware architectures for embeddedhigh-performance and low-power SLAM

Konstantinos Boikos

Supervised by Christos-Savvas Bouganis

Submitted in part fulfilment of the requirements for the degree ofDoctor of Philosophy in Electrical and Electronic Engineering of Imperial College and

the Diploma of Imperial College, April 2019

Abstract

Simultaneous localisation and mapping (SLAM) is central to many emerging applications such

as autonomous robotics and augmented reality. These require an accurate and information

rich reconstruction of the environment which is not provided by the current state-of-the-art in

embedded SLAM which focuses on sparse, feature-based methods. SLAM needs to be performed

at real-time, with a low latency. At the same time, dense SLAM that can provide a high

level of reconstruction quality and completeness comes with high computational and power

requirements, while platforms in the embedded space often come with significant power and

weight constraints.

Towards overcoming this challenge, this thesis presents FPGA-based custom hardware archi-

tectures that offer significantly higher performance than general purpose embedded hardware

for SLAM, but with the same low-power requirements. The works begins by discussing the

characteristics and computational patterns of this type of application, focusing on a state-of-

the-art semi-dense direct SLAM algorithm. Then custom hardware architectures are presented

and evaluated as they emerged from this research work. These combine many novel features to

achieve a performance on-par with optimised software on a high-end multicore desktop CPU

but with more than an order-of-magnitude better performance-per-watt.

The two high-performance, power-efficient architectures for the two interdependent tasks that

comprise the core of real-time SLAM, are designed to work alongside a mobile CPU running

a full operating system, and scale in terms of resources to provide a solution that can be

adapted to most off-the-shelf FPGA-SoCs. Thus, as well as offering the necessary performance

and performance-per-watt to enable advanced semi-dense SLAM on mobile power-constrained

platforms, they stand to bridge the gap between custom hardware and research in algorithms

and robotic vision as they can be adapted and re-used more easily than traditional custom

hardware architectures.

i

Acknowledgements

Firstly, I would like to thank my supervisor, Dr. Christos Bouganis, for his trust, guidance

and support during these years at Imperial College. Our meetings and discussions allowed me

to develop my abilities and confidence as a researcher, helped me develop my critical thinking

and most importantly helped me get a new perspective all the times that this journey towards

a PhD felt too complicated and overwhelming.

I would also like to thank everyone in the Circuits and Systems group for all the interesting

conversations in the lab, for introducing me to topics I would not have discovered alone and

for their feedback on my work on multiple occasions.

A special thanks to Stelios and Alexandros. Without our conversations, technical and not,

ideas shared, and everything else inside and outside the lab this PhD really would not have

been the same.

I would like to express my gratitude to Nikos, Christos, Rafaella and all the other amazing

people from back home, for always believing in me and supporting me. To Miriam; thank you

for everything these years. There is too much to fit in this page. Without your support and

love I would not be where I am now.

Furthermore, I would like to express my love and gratitude to my grandparents Konstantinos,

Voula and Kalliopi for bringing me up, for all their love and for teaching me all those things

about life that grandparents are always better at knowing. Finally, to my parents Nikos and

Fotini; my deepest love and gratitude for always being there and supporting me, emotionally

and practically, in every way you could, and for your love throughout my life.

iii

Declarations

Declaration of Originality

I herewith certify that the work presented in this thesis is my original own. All material in this

thesis which is not my own work has been appropriately referenced.

Declaration of Copyright

The copyright of this thesis rests with the author and is made available under a Creative

Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy,

distribute or transmit the thesis on the condition that they attribute it, that they do not use it

for commercial purposes and that they do not alter, transform or build upon it. For any reuse

or redistribution, researchers must make clear to others the licence terms of this work.

v

‘It is well known that a vital ingredient of success is not knowing that what you’re attemptingcan’t be done’

Sir Terry Pratchett

vii

Contents

Abstract i

Acknowledgements ii

Declarations v

1 Introduction 1

1.1 Machine Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Terminology and concepts in the field of SLAM . . . . . . . . . . . . . . . . . . 4

1.2.1 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.3 SLAM operation and quality metrics . . . . . . . . . . . . . . . . . . . . 7

1.2.4 Independent variables for SLAM and their effect . . . . . . . . . . . . . . 9

1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.5 Aims and Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.6 Research Contributions and Statement of Originality . . . . . . . . . . . . . . . 17

1.7 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

ix

x CONTENTS

2 Background 19

2.1 A brief history of SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Principles of state of the art SLAM . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Direct semi-dense SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Algorithmic overview of LSD-SLAM . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5 Tracking in LSD-SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.6 Mapping in LSD-SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.7 Proposed Architectures and FPGA-SoCs . . . . . . . . . . . . . . . . . . . . . . 44

2.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3 Accelerating semi-dense SLAM 59

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.2 Tracking Algorithm in LSD-SLAM . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.2.1 Tracking in LSD-SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.2.2 The tracking algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.2.3 Mapping and Global Optimisation . . . . . . . . . . . . . . . . . . . . . 70

3.3 Profiling and Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.3.1 Profiling Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.3.2 Timing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.4 Accelerator architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.4.1 System architecture and control . . . . . . . . . . . . . . . . . . . . . . . 77

3.4.2 Residual and Weight calculation Unit . . . . . . . . . . . . . . . . . . . . 81

3.4.3 Linear System Generation - Jacobian Update Unit . . . . . . . . . . . . . 89

CONTENTS xi

3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4 Accelerating Tracking for SLAM 100

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.2.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.2.2 Direct Tracking Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.2.3 Frame Cache Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.3.1 Custom Core Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.3.2 Resource Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

4.3.3 Running as part of a SLAM Pipeline . . . . . . . . . . . . . . . . . . . . 119

4.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5 Accelerating Mapping for SLAM 124

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.2 Mapping Algorithm - LSD-SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.3 Architecture of the Mapping Accelerator . . . . . . . . . . . . . . . . . . . . . . 128

5.3.1 Coprocessor architecture and FPGA-SoCs . . . . . . . . . . . . . . . . . 128

5.3.2 High-level Architecture Overview and Functionality . . . . . . . . . . . . 131

5.3.3 Multi-rate dataflow operation . . . . . . . . . . . . . . . . . . . . . . . . 141

5.3.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

5.4.2 Benchmark selection and Platforms . . . . . . . . . . . . . . . . . . . . . 148

5.4.3 Design Implementation and Resource Usage . . . . . . . . . . . . . . . . 148

5.4.4 Performance and Power Comparison . . . . . . . . . . . . . . . . . . . . 150

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

5.5.1 Achievements of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

6 Conclusions and Future Work 159

6.1 Lessons learnt designing with HLS and FPGA-SoCs . . . . . . . . . . . . . . . . 160

6.2 Generalisation of the presented research . . . . . . . . . . . . . . . . . . . . . . 163

6.3 Research Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

Bibliography 171

xii

List of Tables

2.1 State-of-the-art SLAM examples. Compiled with a focus on features and char-

acteristics of different solutions to demonstrate the breadth of the field in terms

of features and power typical or reported where available power requirements.

Comparison with camera resolution in the same region of MPixels. . . . . . . . . 58

3.1 Profiling Results - Callgrind / x86 Intel CPU . . . . . . . . . . . . . . . . . . . . 74

3.2 Timing Results - Intel i7 - 4770 @ 3.77 GHz . . . . . . . . . . . . . . . . . . . . 76

3.3 Timing Results - ARM Cortex-A9 @ 667 MHz . . . . . . . . . . . . . . . . . . . 76

3.4 Datasets used, provided on TUM’s website by the authors of LSD-SLAM . . . . 94

3.5 FPGA Resources. The first two columns represent the resources post synthesis,

where the tool has allocated a certain number of resources at each instantiated

hardware unit . Post-implementation, Vivado uses various optimisations to re-

duce usage by combining circuits or simplifying various units. . . . . . . . . . . 94

5.1 Resources post-implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

5.2 State-of-the-art SLAM examples. Compiled with a focus on features and char-

acteristics of different solutions to demonstrate the breadth of the field. This is

a simpler version of the Table in Chapter2. . . . . . . . . . . . . . . . . . . . . . 155

xiii

List of Figures

1.1 SLAM Continuum from Sparse to Dense. A more complex and globally consis-

tent map significantly increases computation requirements. . . . . . . . . . . . . 3

1.2 In a moving platform or camera, the performance of SLAM will directly affect

the accuracy of tracking and the accuracy and quality of the reconstruction, and

in extreme cases lead to tracking loss. The map on the right is on a system

that can deliver a 4× improvement in performance compared to the left. The

map on the left has accumulated a very large error from skipping frames due to

performance constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3 Even if the slower system can keep up, improved performance is crucial to im-

prove quality and accuracy under fast movement. The map on the left was

recovered on a system with a performance deficit of 2× compared to the one on

the right . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1 SLAM algorithmic overview. The main tasks are inside the two dashed rectan-

gles. The main data structures are indicated with orange coloured boxes, while

the light blue boxes represent the main tasks necessary to perform tracking and

mapping. With grey we indicate some of the background tasks involved in full

SLAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 Figure adapted from [1] to demonstrate tracking with direct Keyframe alignment

on sim(3) utilizing the estimated depth map. We can see from two camera views

the current state of the map on the left as a collection of inverse depth and depth

variance values for the mapped points and the photometric residual on the top

right as a result of the reprojection. . . . . . . . . . . . . . . . . . . . . . . . . . 36

xv

xvi LIST OF FIGURES

2.3 Pyramid processing. Starting at a lower, coarser resolution increases the radius

of convergence for the optimisation and improves the speed of reaching a minimum 40

2.4 Epipolar geometry, the epipolar line is depicted with orange colour . . . . . . . . 42

2.5 From [1], the top row is different camera frames overlaid with the estimated

semi-dense inverse depth map. The bottom row contains the camera view as a

pyramid with blue edges, with its associated trajectory as a line in front of a 3D

view of the tracked scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.6 Concept heterogeneous architecture block diagram . . . . . . . . . . . . . . . . . 45

2.7 FPGA architecture. Dedicated SRAM memories and DSP blocks with capable

hardened multipliers have significantly improved the efficiency of frequently used

and traditionally costly operations . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.8 A modern FPGA fabric is organised in clock domains made up of slices with the

edges of the silicon usually housing communication circuits and ports. . . . . . 47

2.9 Zynq 7-series FPGA-SoC and Interconnect. Only connections relevant to the

architectures researched in this thesis are included for clarity of presentation.

Source: Zynq-7000 TRM [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1 Control flow of tracking task. The three main sub-functions are described inside

the dashed lines. The rest of the computation and control described in the figure

happens outside these functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.2 SLAM algorithmic overview. The main tasks are inside the two dashed rectan-

gles. The main data structures are indicated with orange coloured boxes, while

the light blue boxes represent the main tasks necessary to perform tracking and

mapping. With grey we indicate some of the background tasks involved in full

SLAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.3 Percentage of total computation for the tasks comprising SLAM . . . . . . . . . 74

3.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.5 Accelerated tracking and mapping execution in software . . . . . . . . . . . . . 79

LIST OF FIGURES xvii

3.6 Intensity value interpolation for a projected point using its 4 neighbouring pixels 82

3.7 Residual and Weight Calculation Unit . . . . . . . . . . . . . . . . . . . . . . . 82

3.8 Tracking involves projecting a map point, with a recovered inverse depth 1/z to

the image plane in the current camera frame . . . . . . . . . . . . . . . . . . . . 83

3.9 Residual Calculation Pipeline - Pixel Re-projection . . . . . . . . . . . . . . . . 84

3.10 Pixel Gradient and Intensity Interpolation . . . . . . . . . . . . . . . . . . . . . 88

3.11 Element Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.12 Different scenes will generate Keyframes with fewer or more mapped points,

which affects the runtime both in Hardware and in Software . . . . . . . . . . . 92

3.13 Average performance in frames per second for two different sequences selected

from the datasets in Table 3.4. The grey colour corresponds to a more dense,

complex scene where a higher number of points are mapped and used for tracking.

Examples of two scenes that would generate maps with a different number of

points are in Fig.3.12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.14 Memory transfer cost in microseconds to synchronise the map and frame with

the hardware buffers on the DRAM. This happens once per pyramid level. . . . 95

3.15 Software and hardware timing for individual functions. These are run multiple

times until the error stops decreasing significantly, at which point the process

is repeated for coarse-to-fine pyramid levels until the penultimate level (once

subsampled from the original image) is reached . . . . . . . . . . . . . . . . . . 96

3.16 In a moving platform or camera, the performance of SLAM will directly affect

the accuracy of tracking and the accuracy and quality of the reconstruction, and

in extreme cases lead to tracking loss. The map on the right is on a system

that can deliver a 4× improvement in performance compared to the left. The

map on the left has accumulated a very large error from skipping frames due to

performance constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

xviii LIST OF FIGURES

3.17 Even if the slower system can keep up, improved performance is crucial to im-

prove quality and accuracy under fast movement. The map on the left was

recovered on a system with a performance deficit of 2× compared to the one on

the right . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.2 Tracking Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.3 Frame Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.4 Frame Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.5 Processing Time Per Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

4.6 Total Time per Level including Memory Copy . . . . . . . . . . . . . . . . . . . 118

4.7 Performance comparison of this accelerator with our previous work presented in

Chapter 3 and with the NEON-accelerated software on the ARM Cortex-A9 as

baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.8 Performance/Resource Scaling targeting a larger FPGA . . . . . . . . . . . . . . 120

5.1 Zynq 7-series FPGA-SoC and Interconnect. Only connections relevant to the

architectures researched in this thesis are included for clarity of presentation. . . 130

5.2 Block diagram of the accelerator architecture . . . . . . . . . . . . . . . . . . . . 132

5.3 Sliding window over current keyframe . . . . . . . . . . . . . . . . . . . . . . . . 134

5.4 Sliding window utilizes shift registers and 4 row buffers . . . . . . . . . . . . . . 134

5.5 The intensity gradient is calculated in the two image directions, for the target

pixel and its four immediate neighbours, resulting in a total of 13 accesses, 20

gradient calculations and a final reduce operation to find the maximum value in

the region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.6 Epipolar geometry, the epipolar line is depicted with orange colour. While the

point will lie on the line, it does not have to appear on the camera’s frame as it

can lie outside that plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.7 For each comparison, the previous four values are re-used and a new one is

calculated by interpolating the values of the four pixels surrounding the floating

point coordinates of the next scan point . . . . . . . . . . . . . . . . . . . . . . 139

5.8 In this case, intensity information, combined with a large baseline in absense of

a strong previous estimate, is insufficient to provide a good match. . . . . . . . . 140

5.9 The units in the fast-rate pipeline operate at two rates simultaneously, with some

control processes, initialization and communication with the rest of the pipeline

happening only when starting or finishing a scan and and the main compute

units operating at a rate of one scan step per cycle. . . . . . . . . . . . . . . . 143

5.10 Heatmap of Depth Map valid points for epipolar scan. Axes represent image

coordinates, with colour representing the frequency of a Keypoint in those coor-

dinates requiring a scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

5.11 Resource scaling with architectural tuning targetting 100 MHz . . . . . . . . . . 150

5.12 Mapping performance in msecs - Different Platforms / Datasets . . . . . . . . . 152

5.13 Power consumption of the devices tested. Here, “This work” refers to the com-

bined power of tracking and mapping accelerators and the CPU operating for

the background tasks of SLAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

xix

Chapter 1

Introduction

1.1 Machine Vision

Building machines with the ability to see and understand the space around them has interested

researchers for decades. At the same time, it is a complex task that, while appearing effortless

for humans, has proven very hard to tackle. In recent years, there has been an explosion

of interest and research effort surrounding intelligent machines and systems. One area of

particular interest is the push towards fully autonomous machines that can move and interact

in an unknown environment. A self-driving car for example, needs continuous awareness of

the environment around it, both static (road surface, lane markings) and dynamic (e.g. other

vehicles) to be able to operate safely and autonomously. Lately, there has been significant

research and progress on this front, utilizing advances in algorithms, sensor and computing

platforms to achieve important milestones towards a system that can continuously generate a

map of what is around it and track itself in that environment.

This type of system is a necessary part for many emerging applications. One example is

household assistance robots e.g. [3, 4] that could finally gain the capability to handle complex

objects that are articulated or deform and to navigate autonomously in a challenging and

dynamic three dimensional environment. A similar line of research is leading to environment-

aware industrial robots that can operate alongside human operators in a much safer way, even

1

2 Chapter 1. Introduction

in co-operative situations. Smaller flying machines are gaining the capability for autonomous

indoors and outdoors flight and exploration, both in quadcopter form [5] and as a fixed-wing

aircraft [6]. These can provide the capability for rapid exploration of large spaces using swarms

[7], much more effective search and rescue operations [8] and precision agriculture used in

conjunction with other vision techniques [9]. Lately, we have witnessed a very strong push

towards fully autonomous cars from both private companies, and academic efforts [10]. Finally,

the field of autonomous robots is closely linked with augmented reality applications which have

to continuously track a user’s location, orientation and movement at a very low latency and

maintain an accurate map of the user’s surroundings.

One of the core elements in this effort is a family of algorithms and systems called Simultaneous

Localisation and Mapping (SLAM), which aims to provide a solution to the problem of exploring

an unknown environment while continuously keeping track of the system’s own position. SLAM

has to integrate a series of observations of an environment from different sensors, using its own

estimated positions, to incrementally build and maintain a map of that environment. The

definition of SLAM has been used in the past to describe many efforts, including slow and

progressive structure-from-motion methods and 2D maps utilizing sensors such as Sonar and

Lidar. However, the forefront of this field and especially to do with autonomous systems that

have to operate in a complex, dynamic environment has to do with systems utilizing a visual

sensor and operating on a moving platform. Thus, this thesis will concentrate on real-time

visual SLAM, which refers to performing all processing live, at the camera’s rate of operation

or framerate.

SLAM in the literature is usually comprised of two main tasks. Localisation, often referred to as

tracking, is the act of continuously estimating the 6-degree-of-freedom position and orientation

(pose), of the camera. Mapping is the task of generating and continuously updating a coherent

model of the environment based on the sensor observations. These two tasks are very closely

interconnected and strongly dependent on each other. Tracking, closes the loop to the mapping

task by comparing the incoming data from the sensor with the map that has been generated

to estimate a current pose. Then, the accuracy of that estimation will determine the quality of

the next map update and how close it will be to reality.

1.1. Machine Vision 3

Sources: ORB SLAM (R. Mur-Artal), LSD-SLAM (J. Engels et al.), ElasticFusion (T. Whelan et al.)

Sparse Semi-Dense Dense SLAM

Mobile CPU High-end Desktop GPU Acceleration

Figure 1.1: SLAM Continuum from Sparse to Dense. A more complex and globally consistentmap significantly increases computation requirements.

Towards addressing the challenges of real-time visual SLAM, a number of solutions have

emerged and the field has gradually generated different approaches, each with their own ad-

vantages and disadvantages. A main categorisation is in terms of map density, describing how

many observations the algorithm deals with. The categories that emerge from this are Sparse,

Dense and lately Semi-dense SLAM. Sparse SLAM uses a small set of observations for Track-

ing and maintains a sparse map of the environment consisting of some observed 3D points of

interest. These approaches exhibit lower computational requirements for a similar accuracy

in estimating the position of the camera in the environment but are limited on the density

and “richness” of the environment’s reconstruction. At the other end of the spectrum, SLAM

algorithms categorised as Dense are now able to construct a complete high quality model of the

environment usually as interconnected surfaces. At the same time they are significantly more

computationally intensive.

To address this drawback a family of works described as Semi-dense SLAM have emerged.

These aim to provide a more dense and information-rich representation compared to sparse

methods, while achieving better computational efficiency through processing a subset of high

quality observations and attempting to reconstruct an as complete model of an environment as

possible from these. However, they are still computationally complex and target desktop-grade

multicore CPUs for real-time processing. In Fig.1.1 we can see three state-of-the-art examples

positioned in this continuum of sparse to dense.


At the same time, combined with the high computational requirements and the low latency

necessary for emerging applications, many platforms come with constraints on the power and

weight they can support. Quadcopters and other UAVs and ground robots impose significant

constraints on both power and weight while other applications that require some form of SLAM

such as augmented reality face even stricter constraints. This has resulted in a large gap

currently between research in SLAM algorithms and SLAM implementations in embedded

platforms. A main cause of this is the large gap in computational capabilities between high-

end GPUs and desktop CPUs and those found in low-power mobile devices or embedded SoCs.

Finally, it is important to note that in the state of the art many dense methods utilize specialised

sensors that can recover an estimate of the depth directly from the visible scenes. The main

example is Kinect-type sensors, projecting an infrared pattern that, together with a special

infrared camera and ASIC combination, give a depth estimate for all objects up to a few

meters away from the camera. This type of sensor, which we will discuss further in Chapter 2,

has enabled very high quality results, but requires higher power consumption, is heavier and

it is constrained in depths of a few meters making it unsuitable for outdoor spaces or large

environments. For this work, low power and weight characteristics are an essential target, to

enable the emerging applications discussed. Additionally, many robotic platforms have to be

able to operate in large or outdoor spaces. Thus, we will focus on works that utilize passive,

monocular cameras, since they are a power efficient and lightweight solution, and can work well

in a variety of environments.

1.2 Terminology and concepts in the field of SLAM

This section will outline and establish the main concepts and terminology that are used through-

out the literature of SLAM and this thesis.

1.2. Terminology and concepts in the field of SLAM 5

1.2.1 Input data

First we shall define the form of data that a SLAM algorithm deals with, when applied in the

context of a SLAM system (e.g. a robot with a set of sensors for localisation and mapping). The

input of a SLAM algorithm is a series of measurements of the environment in which it is applied,

optionally combined with a secondary localisation system using a series of measurements of the

system’s position, rotation or acceleration. The precise form of environment measurements will

vary for different solutions depending on the sensors used, but it will be one or a combination

by the following:

– A series of images from a camera capturing light intensity and optionally including colour

information. When referring to images captured from a camera in the context of SLAM

as part of a sequence, the word Frame is frequently used to describe one of the captured

images in the sequence.

– A series of depth measurements in an area around the system using depth-sensing

sensors such as a sonar or time-of-flight sensor.

– A series of images (or frames) captured from an RGB-D sensor combining both of the

above in an integrated image with depth estimates for all or part of the field of view.

For a SLAM algorithm the above input is used as a sequence of observations. Focusing on

a sequence of images captured from a camera, which is the main input in most of the works

discussed in this thesis, each image is characterised by the following:

– Pixels. Each pixel (originally picture element) is the smallest addressable element of a

digital image sensor, providing a light intensity I (and optionally colour) measurement

for a part of the field of view. There are hundreds of thousands to millions of pixels

composing an image from a modern digital image sensor.

– Height and width, specified as the number of pixels in each dimension.


– Resolution, defined as the total number of pixels per image and obtained from the

product of height and width.

When referring to image pixels in a captured image we shall use their position in terms of

vertical and horizontal dimensions, ranging from (0, 0) at the upper left corner of the sensor

array to (width − 1, height − 1) at the bottom right. Zero-based numbering such as this is

frequently used, since it directly corresponds to an offset from an initial pixel position. As such,

it matches the forms of addressing usually employed in such systems.

1.2.2 Output

For an input of a series of frames, the output of a SLAM algorithm consists of a generated

map of the environment around the system and an estimate of the sensor’s (and by extension

the system’s) pose.

The recovered pose produced by the SLAM system is an estimate of the position and orientation

of the system, at each captured frame, in relation to the origin point in a process known as

tracking. To recover the pose, the information in the input (a captured frame) needs to be

compared to and matched with known points of the environment and optimised against them.

These known points of reference are part of the generated map of the system.

A map for the SLAM system is a representation of a generated model of the environment.

It differs between SLAM systems, as we will discuss more in the following two chapters, but

for most algorithms is composed of a set of points, where a point is an observed part of the

environment in terms of a single pixel or a set of neighbouring pixels. The actual size of a

“point” in the real world is variable and depends on camera parameters such as focal length

and the distance from the object observed.

A point is composed of its position coordinates in three-dimensional space (x, y, z) in relation

to an origin point (0, 0, 0) and other identifying information. This information may be just the

light intensity and colour of that part of the environment, but could include other descriptors.


For example, in state-of-the-art feature-based SLAM, a pixel’s intensity is compared with its

neighbours in a small area. The results of those comparisons are stored as a set of binary digits

to describe that local pattern. The origin point is conventionally chosen to be the position of

the system at the time of the first captured frame.

The map is generated by extracting information from the sequence of captured images. Algo-

rithms that attempt to generate a more complete reconstruction of the environment use surface

elements (surfels), volume elements (voxels) or other forms of interconnected surfaces. These

different representations are an active research area in an effort to generate an efficient but

complete model of the environment around a system.

In SLAM algorithms, the more complete the reconstruction the less information is stored per ele-

ment. SLAM algorithms that require a few hundreds or thousands of points in three-dimensional

space to compose a map, frequently employ complex descriptors for a small area around each

point to guarantee accuracy when matching the same point across different views. On the other

hand SLAM algorithms that attempt a full surface reconstruction rely exclusively on the light

intensity or colour of the recovered surfaces and track against most of the visible portions of

the map at the same time.

1.2.3 SLAM operation and quality metrics

In a typical SLAM system processing always begins from the flow of information at the input.

The system captures a frame from the camera and begins matching and comparing the captured

information with the model of the environment that has been generated (the map). This

matching and comparison forms part of an optimisation process towards estimating the pose

of the camera that minimizes the error between the model and the actual observations.

This pose estimate at the end of the optimisation process is then used as the input of a mapping

process that attempts to integrate the newly captured information to improve the accuracy of

the current map and add new pieces of information in it. As such, the main quality metrics

used for SLAM have to do either with accuracy or performance. Regarding the accuracy, these


metrics are either about the accuracy of the successive pose estimates (one for each captured

frame) or with the accuracy of the generated map of the environment.

The accuracy of the pose is quantified as the distance (error) between the real-world three-

dimensional position of the sensor in relation the origin point and the estimated position of the

recovered pose. To capture the behaviour of the algorithm across a dataset, one of the most

frequently used quality metrics is the root mean squared error of the poses that comprise the

entire captured trajectory. It gives a good estimate of the average behaviour of an algorithm

while also penalising large errors even if they are encountered on fewer observations. Other

metrics can be used around the accuracy of the pose estimates as well, such as the rotational

error (error in the orientation estimates of the camera in terms of degrees) or loop closure

errors (the distance between the estimate and the real-world position when returning to the

same world point).

As opposed to tracking accuracy, the quality of reconstruction has been less frequently used in

published literature to compare competing algorithms. However, this is gradually starting to

change in recent works. It is usually discussed qualitatively, as opposed to using a quantitative

metric, even in state of the art methods, in terms of the completeness of the map and the

amount of visible distortion in comparison to the actual environment captured.

The performance metrics for a SLAM algorithm have to do with latency and throughput.

Latency, defined as the time between the arrival of new input data and output data being

ready, is important in fast and agile platforms such as quadcopters or other autonomous aircraft

where the pose and map may be used online for tasks such as obstacle avoidance. Relevant

latency measurements are:

– From the moment of capturing a frame until the estimated pose for that frame is recovered.

– From the moment of capturing a frame until the information in that frame and the

recovered pose estimate have been incorporated into the map

The choice of which one to use depends on how the SLAM output is used in the context of a

latency critical application.


Throughput, in terms of frames per second processed, is more frequently used as it will

directly dictate the speed of robust operation and indirectly affect the quality of both pose

estimation (tracking) and mapping. This has mainly to do with the fact that SLAM operates

in unknown or partially recovered environments and requires a relatively small amount of

rotation and translation between successive frames for robust operation. A high throughput

will satisfy this requirement even for faster and more agile moving platforms. Throughput is

sometimes directly referred to as framerate since it is most often measured in terms of frames

processed per second.

1.2.4 Independent variables for SLAM and their effect

Depending on the specific algorithm implemented there are a significant amount of variables

that can be tuned to affect the quality and performance of that algorithm in different situations.

This subsection will summarise and describe the most common and important ones.

Variables relating to Map density

Many SLAM algorithms, including semi-dense SLAM, will not attempt to reconstruct 100% of

their environment but will instead recover portions of the environment in terms of a collection

of “mapped points”. In these algorithms there are various thresholds that affect the number

of points that the algorithm attempts to recover as part of the map. This also affects the

complexity of the tracking task, as a larger number of recovered map points means a larger

amount of points to match and compare for pose estimation.

Increasing the map density can increase the accuracy of tracking and mapping up to a certain

level, but then a point of diminishing returns is reached. It can also enhance the quality of the

map, and in some cases a comparatively high density is required for applications in which a

higher degree of environment awareness is important. However, it will also increase the latency

of the tracking and mapping tasks as the amount of data to process in both tasks increases.


Image resolution

The resolution of the captured images from the camera sensor will dictate the smallest ob-

servable part of the environment, and the amount of observations in each frame. As such, a

higher resolution stands to improve the algorithm’s accuracy (again until a point of diminish-

ing returns) but will increase the number of elements that need to be processed and directly

increase the latency of the tracking and mapping tasks. It should be noted that the frame

resolution used for the tracking and mapping tasks is not necessarily the same, as the image

can be downscaled for either task after being captured at the camera and does not need to be

equal.

Variables relating to noise modelling

SLAM algorithms are probabilistic and are dealing with imprecise sensor measurements and

generated estimates. As such they will attempt to correlate different observations to different

degrees and most SLAM algorithms will attempt to model sensor and estimate noise to im-

prove their quality. Depending on the algorithm and its implementation there will be different

independent variables relating to noise modelling that can be tuned to improve the quality of

the output for different environments and sensors used.

1.3 Motivation

This section will discuss the motivation for the work presented in this thesis. Due to power con-

straints, most embedded visual SLAM implementations focus on sparse SLAM that is adapted

towards reducing computational requirements. One main approach is to trade information

richness and robustness for performance by using lower sensor resolution, track a sparser set

of points instead of surfaces or objects and avoid denser or large-scale mapping and map

consistency completely. These solutions are either designed from the ground up to be com-

putationally efficient, for example [11] and [12] or are reduced versions of existing approaches.

Another approach on embedding sparse SLAM has been to design a lightweight but accurate

visual odometry algorithm that can achieve real-time performance on-board an embedded de-

1.3. Motivation 11

vice, with the option of offloading computation to a remote server for reconstructing a dense

map [13, 14] there. This comes with increased power consumption for the wireless communi-

cations, as well as increased latency. It also comes with a reduced area of operation, and very

high bandwidth requirements, and is not a good fit for applications requiring awareness of a

dense and complex 3D scene.

Moreover, this trend towards more accurate and advanced but computationally demanding

SLAM algorithms seems to be continuing at faster pace than advances in computing platforms’

raw performance. The examples of emerging applications in the embedded space introduced

previously require the level of quality only state-of-the-art, large-scale dense or semi-dense

SLAM can provide. These systems require a high level of understanding of their environment

that sparse SLAM inherently cannot provide.

At the same time, due to safety and robustness requirements, there is a need for very low

processing latency in many platforms. In most state-of-the-art SLAM algorithms there is

an underlying assumption that there is a small amount of translation and rotation between

successive processed frames. This means that depending on the movement speed of the camera,

a minimum level of performance for a certain platform is necessary to avoid very large errors

and potential failure in tracking due to fast movement. However, further improvements past

that minimum are important as they can provide a larger improvement in quality and accuracy

for the same algorithm. Figures 1.2 and 1.3, discussed further in Chapter 3, demonstrate the

impact of reduced performance during live SLAM, causing frames to be dropped instead of

processed from the camera. In Fig. 1.3, a 2× reduction in performance in the left side of the

figure compared to the system on the right, creates a large accumulated trajectory error and a

visible drop in quality and detail for the recovered map surfaces.

Meanwhile, in addition to the performance requirements and complexity of SLAM, most embed-

ded robotics and augmented reality applications have significant power and weight constraints.

These specifications rule out most of the conventional hardware that can perform cutting-edge

SLAM in real time in the embedded space. This amount of computational demand, not met

by software optimisation, can be met by new hardware platforms. Towards closing this gap


Skipping Frames due to performance constraints

Adequate performance for the movement speed

Figure 1.2: In a moving platform or camera, the performance of SLAM will directly affect theaccuracy of tracking and the accuracy and quality of the reconstruction, and in extreme caseslead to tracking loss. The map on the right is on a system that can deliver a 4× improvementin performance compared to the left. The map on the left has accumulated a very large errorfrom skipping frames due to performance constraints

Higher performance leads to more information processedReduced drift and error accumulation for pose and map

Figure 1.3: Even if the slower system can keep up, improved performance is crucial to improvequality and accuracy under fast movement. The map on the left was recovered on a systemwith a performance deficit of 2× compared to the one on the right

1.3. Motivation 13

between performance and efficiency for embedded SLAM, a custom domain specific hardware

architecture can provide a significant improvement both in terms of absolute performance, and

performance-per-watt compared to conventional general purpose processors. General purpose

hardware has to be fast and economical for the average case of software, and support all com-

binations of an instruction set. Modern high-performance CPUs can take advantage of some

instruction level parallelism, being able to concurrently translate and execute several arithmetic

instructions in parallel (referred to as how wide a CPU’s microarchitecture is), and different

parallel tasks can execute on multiple cores. However, their flexibility for supporting any soft-

ware task, means they will necessarily have a lower efficiency for special cases than a custom

solution.

Meanwhile, GPUs have seen significantly increased use as accelerators, often referred to as single

instruction, multiple threads (SIMT) processors. At the time of writing, standard software

languages such as C++ can be used to compile and execute code on the processing elements

in a GPU in a relatively straightforward fashion. They offer much more parallelism than

CPUs with their simpler but very wide vector instruction units, but they are still throughput

optimised many-core architectures targeting data-level parallelism. Their efficiency is thus

limited to certain types of algorithms that fit this model well and they usually have a much

higher latency to process a single thread than a traditional CPU.

Algorithms such as direct tracking for SLAM do not fall neatly in either case. They do come with

some instruction-level parallelism but are still computationally intensive for embedded CPUs,

and do not fit the data-level parallelism requirement for mapping on a GPU or multiple cores due

to data dependencies and a non-homogeneous computations between iterations with complex

control flow. Moreover their data-intensive nature would stress either of these platforms, both

in terms of data movement and random access patterns. The advantage of custom hardware is

twofold. First, by precisely controlling resource allocation a design will only have the processing

elements necessary for the task at hand. This means better utilized area, with less resources

taking space and power. Second, by designing and optimising the hardware architecture to

match the needs of a specific application its power efficiency and achievable performance can

be improved by a large margin.


At the same time, hardware realised as a custom integrated circuit, referred to as Application

Specific Integrated Circuit (ASIC), has a high initial cost of design and production. As such,

ASICs are usually produced when that cost is amortised by large production volumes, enabled

by the amount of re-usability and standardisation of an algorithm. Between these platforms

sits reconfigurable hardware, such as Field Programmable Gate Arrays (FPGAs).

FPGAs still allow most of the customisability of designing an ASIC, since they can emulate most

digital designs composed of digital gates and SRAM memory banks. At the same time, since

they are reconfigurable, they have a much smaller implementation cost and can be updated in

the field with a change in the HDL (Hardware description language) code describing the design

and a synthesis and place and route step which is a matter of hours as opposed to the cost of

issuing a new version of an ASIC chip which is almost as significant as designing it from scratch.

In this way, by utilizing this platform one trades a 7-20x reduction in performance and 10x

in power/area compared against an optimised CMOS implementation [15,16] but significantly

reduces the costs and timeline of development. Thus, a custom hardware design can be realised

and deployed faster and at a lower cost.

Crucially, even with this disadvantage over ASICs they are still more power efficient and achieve

a higher performance per watt than general purpose architectures for certain algorithms. More-

over, SLAM is currently a volatile field, with the methods and algorithms changing significantly

every few years. In light of this, taking advantage of the lower time for redesign on FPGAs,

the advantages of specialised hardware can be realised now on off-the-shelf devices, and the

hardware can then be updated to keep up with the software better than a specialised ASIC

could, and at a lower cost. Then finally, if the algorithms become more standardized, a well-

defined generalisable architecture can carry over most of its principles and conclusions and be

re-implemented on an ASIC to maximise the gains of custom hardware.

To conclude, by using off-the-shelf reconfigurable logic to develop custom, specialised hardware

for SLAM, a significant performance-per-watt advantage can be gained compared to general

purpose hardware. This can enable the use of advanced state-of-the-art SLAM algorithms

on low-power embedded platforms, bridging the gap between the current state-of-the-art in

1.4. Research Question 15

embedded vision and the latest trends in SLAM algorithms and providing a solution towards

achieving advanced capabilities for emerging applications. As we shall demonstrate in this

thesis, the accelerators that emerged from our research work addressed fully the computationally

demanding tasks in SLAM and offer real-time high-performance SLAM (more than 60 frames

per second) at a power-level an order of magnitude smaller compared to the general purpose

hardware widely used for state-of-the-art SLAM algorithms.

1.4 Research Question

In recent years many emerging applications have appeared in the fields of autonomous robotics

and embedded systems that require advanced computer vision capabilities. The high per-

formance requirements of algorithms that can provide such capabilities, combined with the

resource constraints of most embedded platforms, led to the research question underlying this

thesis:

“Is it possible to design a custom hardware architecture for embedded platforms that can bring

advanced SLAM capabilities with high performance together with low-power characteristics and

how should such an architecture be designed?”

Towards answering this question, we have focused on the following criteria. Firstly, such an

architecture has to be applicable to state-of-the-art algorithms. SLAM is a field that is evolv-

ing fast, and focusing on older, simpler algorithms would diminish the impact of a custom

hardware solution. Moreover, it has to achieve a significant improvement in performance per

watt compared to current embedded platforms for robotics, while maintaining or improving

on their power requirements for continuous operation. Finally, to have a high impact in the

field of embedded SLAM, it has to maintain real-time operation, being able to process at least

30 frames per second even in complicated environments, while if possible maintaining a low

latency from input to output.


1.5 Aims and Thesis Overview

In the work described in this thesis, we propose two specialised, domain-specific architectures

based on FPGA-SoCs to match the unique demands of these algorithms, and bring high per-

formance for state-of-the-art SLAM on an embedded power budget. The aim of this thesis is

to first present an overview of state-of-the-art visual SLAM with a focus on semi-dense meth-

ods, that offer a more information rich output than traditional sparse approaches in robotics.

Subsequently we will discuss in depth the performance and computation patterns of a state-

of-the-art example, LSD-SLAM [17], and proceed to present our research in overcoming the

challenges described above with designs based on FPGA-SoCs. This will lead to two high

performance dataflow designs that accelerate both the tasks of tracking and mapping for semi-

dense direct SLAM. These achieve a high-end desktop CPU level of performance at an order

of magnitude lower power consumption. Moreover they are designed to be scalable and easily

expandable/modified, and can be implemented on off-the-shelf FPGA-SoC chips.

Chapter 2 presents the necessary background and details for the algorithms discussed, utilized

and accelerated in this thesis, as well as the hardware platforms, tools and design principles

that are used as part of this work. Chapter 3 will present our profiling and characterisation

results for LSD-SLAM and a first straightforward deeply pipelined solution. Some first domain

specific optimisations are discussed, as well as the functionality of the tracking task in this

algorithm, and finally the resulting bottlenecks and challenges that were encountered and led

to the high-performance solutions of the next two chapters.

Chapter 4 will then discuss a novel, highly-optimised dataflow based approach that achieves

desktop-grade performance at an embedded grade power envelope for the task of tracking.

Taking advantage of the lessons learned in our previous research, the design presented in that

chapter focuses on providing an efficient, high-performance solution for Direct Tracking using

a high-bandwidth streaming architecture, optimised for maximum memory throughput.

Chapter 5 will present the design of a scalable and high performance, power efficient specialised

accelerator architecture targetting mapping in the context of a semi-dense SLAM algorithm.

1.6. Research Contributions and Statement of Originality 17

Finally, Chapter 6 will discuss our overall conclusions and contributions, and our view on the

future of this work and the field in general.

1.6 Research Contributions and Statement of Originality

This thesis presents the following original contributions on achieving high performance, state-

of-the-art SLAM on embedded low-power platforms. These contributions have affirmatively

answered the question stated in Section 1.4, covering all of our initial requirements.

• The characterisation and analysis of a state-of-the-art semi-dense SLAM algorithm, LSD-

SLAM. The results obtained guided the research of the domain-specific accelerators pre-

sented in this thesis, and gave a detailed overview of the performance requirements and

characteristics of this type of algorithm.

• The first, to the best of our knowledge, design for a high performance, power efficient

accelerator architecture for direct photometric tracking for SLAM on an FPGA-SoC.

• An investigation on factors that affect the performance of this type of accelerator, the peak

performance that can be achieved by an off-the-shelf FPGA-SoC and bottlenecks that can

arise when used as a coprocessor next to a mobile CPU for this type of application.

• The design of a scalable and high performance, power efficient specialised accelerator

architecture, that can process and update a map in less than 20ms, the average latency

between two camera frames in the context of semi-dense SLAM in state of the art robotics

applications.

• Lastly, a system-on-chip that can be combined with the work presented in Chapter 4

to form the first, to the best of this author’s knowledge, complete SLAM accelerator on

FPGAs targeting semi-dense SLAM, pushing the state of the art in performance and

quality for dense SLAM on low-power embedded devices.


These represent the author’s own work during his studentship at the Circuits and Systems

group of the Department of Electrical and Electronic Engineering, Imperial College London.

1.7 Publications

The original contributions of this thesis have been published in the following peer-reviewed

conference proceedings:

– ‘Semi-dense SLAM on an FPGA SoC’, Konstantinos Boikos and Christos-Savvas Bouga-

nis, 26th International Conference on Field Programmable Logic and Applications (FPL

2016), IEEE, 2016 [18]

– ‘A high-performance system-on-chip architecture for direct tracking for SLAM’, Kon-

stantinos Boikos and Christos-Savvas Bouganis, 27th International Conference on Field

Programmable Logic and Applications (FPL 2017), IEEE, 2017 [19]

– ‘A Scalable FPGA-based architecture for depth estimation in SLAM’, Konstantinos Boikos

and Christos-Savvas Bouganis, 15th International Symposium on Applied Reconfigurable

Computing (ARC 2019) [20]

Chapter 2

Background

2.1 A brief history of SLAM

The roots of SLAM can be traced back to the beginning of the 20th century with the first

analytical methods to use multiple views (photographs) of the same scene to recover some of

its geometry. This itself can be attributed to the first discoveries of lenses and optics centuries

ago that lead to the development of projective geometry followed many years later by the

earliest efforts to define epipolar geometry.

This was formalised as the principles of photogrammetry, with one of the most important early

works published in 1899 and again in 1904 by Finsterwalder [21]. For these early works I am

going to refer to the translation published by Guillermo Gallego et al. [22] and the survey of

Sturm [23]. In that work, Finsterwalder demonstrates that from a set of uncalibrated images a

projective 3D reconstruction is possible and provides an algorithm utilizing seven points for the

case of two images. Kruppa et al published in 1913 his work [24] in which first he discusses the

earlier findings of Finsterwalder in his paper “Die geometrischen Grundlagen der Photogram-

metrie” (The Geometric Foundations of Photogrammetry) where two views with known inner

orientation are sufficient to determine an “Object” up to scale. He then discusses a number

of improvements, including proof that with a calibrated set, 5 pairs of correspondences are

sufficient to provide a solution. According to Sturm [23], the concept of projective reconstruc-

19

20 Chapter 2. Background

tion, discussed in these earlier works was re-discovered in computer vision in the early 1990’s.

Indeed, in these early papers one will discover a discussion quite similar to the one found in

the beginning chapters in a modern leading book in computer vision, published a century later,

“Multiple View Geometry” by Hartley and Zisserman with a slightly updated terminology.

In the late 1980’s to 1990’s a very crucial improvement in the field came with the introduction

of automatic feature detection to replace the manually hand-matched correspondences of the

early works. Those include among others the Harris and Stephens corner and edge detector [25],

followed by various approaches, such as the computationally efficient FAST [26] introduced in

1988. Later approaches in feature detectors focused in new development boosting accuracy.

SIFT by David Lowe [27] aimed to add scale invariance, and a descriptor that would allow

accurate matching in a global space of points where now features would be matched using

the similarity of their descriptors as a score. SURF [28] attempts to improve both accuracy

and performance and gained wide adoption in computer vision, but was still time-consuming to

compute and compare. Later works that followed were binary descriptors such as ORB [29] and

BRISK [30] that aimed to provide a comparable or better accuracy overall, while being much

faster to compute. Avoiding costly floating-point operations, while taking advantage of their

formulation to add capabilities such as rotation invariance, these led to much more efficient

SLAM and computer vision algorithms.

In addition to the method used to track features and recover the pose, various methods are

used to generate and maintain a map. Most early works used a version of Kalman filtering to

optimise over a set of uncertain observations form uncertain positions, jointly optimising all

parameters e.g. Jin et al. [31] and later Davison et al. [32] recovering a pose estimate and the

map in one step. Meanwhile, the field saw the separation of tracking and mapping in separate

threads, and the separation of the optimisation problems firstly proposed by Klein et al with

Parallel Tracking and Mapping (PTAM) one version of which is published in [33]. Now large

scale maps could be optimised in terms of “Keyframes” as a background graph optimisation

problem, which led to improved large scale behaviour and performance taking advantage of

new multi-core hardware architectures.

2.2. Principles of state of the art SLAM 21

Simultaneously, the first dense and direct methods were being developed, although they were

initially non-real time. Matthies et al. proposed in 1988 a method for the “Incremental Esti-

mation of Dense Depth Maps from Image Sequences”. They described an algorithm which uses

pixel-based information to estimate depth and depth uncertainty at each pixel and refine these

estimates over time, utilizing a set of pictures with a known relative translation. They compare

their method analytically and qualitatively with the point based approaches of the time with

very encouraging results, and establish a rigorous theoretical base. They were then followed

by the work of Hanna, who achieved “an estimation of both ego-motion and structure-from-

motion” directly from a sequence of images, minimizing photometric error, essentially building

the first example of direct SLAM. Despite these advances, the field was first considered for

real-time SLAM two decades later, when advances in hardware allowed such formulations to

be explored.

In the following section we will discuss the current solutions, and the principles of operation of

SLAM at the time of writing. The focus will be on the current state of the field of SLAM, the

kinds of sensors and algorithms used, as well as an overview of their respective advantages and

disadvantages for embedded operation in light of emerging applications.

2.2 Principles of state of the art SLAM

In the field of SLAM, different sensors have been used [34,35] including Lidar, sonar and recently

RGB-D cameras e.g. [36]. Lidar and sonar deal with a map usually limited to two dimensions

around the sensor, and were used in early works for their simplicity and effectiveness. They

are also used as complementary sensors with sensor-fusion algorithms combined with visual

sensors both for applications which demand a high-level of robustness and accuracy, such as

self-driving cars. However they are heavier compared to a monocular camera, require high

power consumption and are mostly constrained in two dimensions, making them unsuitable for

many applications, especially as the sole sensor.

Active camera sensors in the RGB-D category combine a camera sensor with a technology


utilizing projected light to directly measure distance from the camera. Most recent approaches

recover depth by either projecting a grid light pattern in infrared and then using an infrared

camera to capture it, or the time-of-flight principle where a burst of infrared light is emitted

and then the time of reflection from a surface can be used to estimate depth. They both

utilize proprietary hardware and algorithms to directly send post-processed image and depth

information to the platform they are connected to. These have enabled high-quality dense

3D reconstruction in indoor spaces [37, 38] but are constrained in their area of operation to

indoor spaces of a few meters because of their design principles. They are also more expensive

and power-hungry than a simple visual sensor, making them less attractive for embedded low-

power robotics. As such, the work presented in this thesis focuses on enabling high quality

embedded SLAM using purely visual information from monocular cameras, as they work equally

well indoors and outdoors for larger spaces. Especially considering cameras developed for

computer vision applications lately that combine global shutter and lower noise levels with

their traditional advantage of lighter weight and power requirements in comparison to other

sensors, a passive visual sensor becomes a very good fit for lightweight, power-constrained

platforms such as the ones we are targeting.

In the domain of visual SLAM, and especially direct, semi-dense SLAM, the rate of processing

needs to be fast enough so that successive frames have a relatively small amount of translation

between them [1]. The desired rate of update varies for different applications. Focusing on

autonomous robotics, which is one of the central motivations for this field and our work, research

has shown that effective localisation needs a performance of at least 30 frames per second for

most moving robotic platforms. Moving to faster moving platforms, such as self-driving cars

and quadcopters, 50 to 60 frames/s or in some cases even higher performance can be necessary

for SLAM not to fail and lose tracking under agile movement as discussed in [39].

Faster moving platforms, such as fixed-wing aircraft, necessitate even faster sustained tracking

rates than that. In the work of A. Barry et al. [6] for example, stereo-based localisation and

mapping for obstacle avoidance on a fixed-wing drone updates the map with tracked points at a

rate of 120 frames per second. Augmented reality applications meanwhile, require a latency of

less than 16 milliseconds per frame from camera, to model update, to visualisation to provide


an acceptable experience. Even though the number most widely used to refer to performance

in this field is frames per second, or Hertz in terms of update frequency, the important figure

is the combined latency of the localisation and mapping tasks, since the processing happens

in real time. While the latency of the camera tracking is the minimum time between two

camera frames being tracked, in reality the combined latency of tracking and mapping added

is crucial, as it is this latency that dictates the moment in time that the next camera frame

can be processed using an updated model of the environment, to utilize the most amount of

information and provide the best result.

SLAM is a complex probabilistic problem where visual information consisting of hundreds of

thousands of pixels needs to be processed for every frame. The first solutions focused on

tracking a small set of a few hundred to a thousand points, however state-of-the-art methods

are moving to complete surface reconstruction and dense point clouds where most of the visible

points need to be processed for every frame. The camera resolutions used are normally in the

region of 640x480, sometimes going up to 960x720. It was found that increasing the resolution

up to that level provides a small benefit to feature-based algorithms, and has a smaller positive

effect on direct algorithms [40] 1, while the runtime usually increases at least linearly with the

number of pixels. All this results in high computational requirements. State of the art SLAM

needs at least high-end multicore CPUs to run in real-time for feature-based and semi-dense

methods [17, 41] and often high performance GPUs for acceleration as for example in [42],

and [36] targeting a processing speed of at least 20-30 frames per second. McCormac et al.

in 2018 [38] showed impressive results in environment reconstruction and semantics but they

report a performance level of 4 to 8 frames per second for “unoptimised code” running on a

combination of high-end graphics card and multi-core CPU.

One other distinction in recent works is the actual method of tracking, and specifically the level

of visual feature at which SLAM operates. Within the continuum of sparse to dense SLAM

(which can be used to describe both the map density, as well as how much of it is used for the

task of tracking), SLAM is also categorised as direct and indirect (or feature-based). Indirect

1The effect varies on a case by case basis for different scenes. In the cited paper the accumulated error overmany runs in different datasets is used to demonstrate the effect


SLAM first scans incoming camera frames to generate a list of features and their descriptors.

In a second step these are then matched with past observations to generate a list of matches,

including filtering steps to remove outliers. Then the optimisation step is performed on that

set of matches, optimising the geometry of the points in a joint-optimisation with the camera

position. Finally, a consistent probabilistic framework, proposed in works such as MonoSLAM

by Davison et al. [32] can significantly reduce error accumulation and drift in the trajectory

estimation and improve robustness of the algorithm. This probabilistic tracking of high-quality

features continues to give promising results in camera tracking. The main drawbacks of these

methods are the expensive step of extracting high quality features and matching them across

frames, the computational complexity of the probabilistic fusion of this information and finally

the fact that the local map they produce is inherently a sparse selection of points, and requires

further processing to result to a dense reconstruction of the environment.

On the other hand, Direct SLAM uses the pixel intensity information in the camera frame to

optimise the pose, based on the photometric error of aligning the images from two poses, and a

photometric gradient to guide optimisation. Such an approach was first proposed in the 1990’s

but without the computational capabilities to run in real-time until recently. Current state-of-

the-art works include Stuhmer et al. [43] and DTAM by Newcombe et al. [42]. The idea is to

directly utilize all the pixels visible in the camera and attempt to track a dense model of the

environment between frames in the form of surfaces. This is computationally complex; GPU

acceleration was leveraged in Newcombe’s work to achieve real-time performance defined by the

authors as “in the region of 30 frames per second”. According to the authors, this work showed

that “... a dense model permits superior tracking performance under rapid motion compared

to a state of the art method using features” when compared with the feature-based methods

available at the time. The authors also demonstrated the usefulness of the dense model for

interacting with the environment with a physics-enhanced augmented reality application. This

work inspired later state-of-the-art dense methods mentioned here in this thesis such as [36].

One main issue with discussing and comparing work in the field is that many performance

statements such are written in the language quoted above. In this thesis, where possible the

performance and complexity of a solution is described in as accurate terms as possible but


in cases where accurate statements are not included in the published works we can only offer

educated guesses as to what the authors are referring to.

The idea of directly using photometric information for tracking and mapping was developed

towards a different direction by Engel et al. in his work in LSD-SLAM [17] and Direct sparse

odometry (DSO) [44]. In LSD-SLAM, he showed that one can still attempt to create a model

of the environment directly from photometric information to achieve robustness and accuracy,

as well as a denser map. However, by ignoring the parts of the frame that are poor in infor-

mation such as empty space and flat textureless surfaces the runtime is reduced significantly,

allowing the algorithm to be more lightweight and run in real time on a multicore CPU. It

is also important to note that such surfaces cannot have depth recovered by matching visual

information anyway, since there are no distinctive patterns to match, and are one of the known

failure modes of visual disparity estimation. This was demonstrated to run with a performance

of 30 frames per second for tracking and 15 frames per second for mapping on a laptop CPU

in the first formulation of the algorithm in [1] and is mentioned as running “in real-time on a

CPU” in [17].

Again the problem with such statements from the authors is they are vague and do not offer a

good estimate of how powerful the machine actually are. However, one of the versions of that

paper gives a hint that an “Intel i7 quad-core CPU” was included in the laptop, and given the

time of publication, that refers to an Ivy Bridge or Haswell architecture Intel CPU, and the

only quad-core laptop models where 45-47W TDP chips, offering 4-6 megabytes of cache and

performance for different scenarios approximately 20-40% less than their desktop counterparts.

In addition, regarding image resolutions processed, this work, like most of the works we have

cited, uses a VGA resolution (640x480 pixels) for their quoted performance figures, so unless

expressly stated, the reader should assume this resolution figure for the works following.

Engel et al. took a different approach with their following work, DSO [44]. By constructing a

sparse sampling of points across the whole image, including low texture regions, and avoiding

a smoothness prior that makes a consistent probabilistic model infeasible to calculate in real

time, the authors developed a joint optimisation approach for all model parameters that runs


in real-time. In this approach, different modelled parameters including camera poses, camera

intrinsics (exposure, vignetting) and geometry parameters (inverse depth values) are jointly

optimised [44] in an effort to improve accuracy by explicitly modelling all known optical and

geometric variables. Optimisation is performed in a sliding window, where camera poses that

leave the field of view are marginalised, inspired by [45]. The information tracked is the pixel

intensity, and the map density can be tuned to trade off runtime with quality and resulting

density in different platforms.

In that work, they report superior performance to odometry methods using hand-crafted or

learned features adding merit to the idea of directly utilizing the visible information and an

inverse depth formulation. Finally, in that work the authors demonstrate the different sensi-

tivity of feature-based and direct SLAM to different sources of error. According to [44]: strong

geometric distortions caused by rolling shutter and imprecise lenses have favoured the indirect

approach, while newly developed cameras specialised for computer vision with a high framerate

and a global shutter can lead direct formulations such as DSO to achieve superior accuracy on

well-calibrated data since they directly model the photometric error.

A final important distinction is the difference between a full SLAM system and a visual odome-

try algorithm. Visual odometry focuses on maintaining an accurate position estimate and uses

the simplest most efficient form of map possible. Hence, it was used in early robotics to improve

the state of the art towards navigating in unknown environments. On the other hand, moving

from navigating to interacting autonomously, and building more capable, environment-aware

systems, a more complete reconstruction and understanding of the environment becomes neces-

sary. Full SLAM methods attempt to recover as much of their environment as possible, as well

as keep a global, consistent map and enabling loop-closing. SLAM focuses on reconstruction

accuracy and completeness as well as accurate odometry, and the quality of a SLAM algorithm

has to be judged on both tasks at the same time, localisation and reconstruction accuracy as

well as its large scale consistency and map management.

The literature so far has only recently starting discussing the quality of reconstruction and the

ability to efficiently re-visit places and handle global maps, and with the exception of some

2.3. Direct semi-dense SLAM 27

notable works such as the benchmark from Handa et al. [46], McCormac et al. [38] and Whelan

et al. [36] most works only report metrics on trajectory errors to evaluate SLAM. In practice,

map quality is a crucial aspect for many applications and it significantly affects computational

requirements since it impacts the amount of information that is encoded and needs to be

processed. The community of SLAM has initiated discussions around this, and we expect this

to change in the future, but at the time of writing most published works discuss “tracking

accuracy” or “tracking performance” as a quantitative metric, while mapping, reconstruction

and other capabilities even when discussed it is mostly on a qualitative level.

2.3 Direct semi-dense SLAM

One of the best performing and most well-known SLAM systems currently is LSD-SLAM,

which was published by Engel et al. in 2014 [17]. In comparison to other state-of-the-art

methods such as ORB-SLAM [41], it is the only one to achieve robust and efficient semi-

dense tracking, simultaneously with semi-dense map reconstruction. A high-density map is

an important feature for robotics, and skipping the step of feature extraction and matching

means operating well in a variety of environments and scales without relying on specific types

of features to work well.

A semi-dense SLAM algorithm generates efficient but high-density maps and can function

both indoors and outdoors relying on lightweight and power-efficient passive monocular camera

sensors. It stands to offer a good middle ground between dense methods requiring high-end

GPUs and custom RGB-D camera sensors and the lightweight point-based SLAM and visual

odometry algorithms that have so far been the state of the art for embedded systems. For these

reasons, and the accuracy of LSD-SLAM compared to other state-of-the-art methods, it was

selected as the best candidate to accelerate with the work presented in this thesis. As such, the

rest of the background will focus on semi-dense SLAM and the principles behind LSD-SLAM

to the extent that they are relevant to this thesis.

The rest of this section is based on the original work by Engel et al. [1, 17] and the discus-


sion included in his PhD thesis. A good overview of the details relevant to accelerating and

implementing the discussed algorithms in hardware will be provided here, focusing on factors

that have an impact on custom hardware design and information that is relevant to under-

standing the functionality of these tasks. For a more detailed and complete discussion on these

algorithms and their software implementation the reader should refer to the cited work.

SLAM is usually discussed in terms of tracking and mapping, however state-of-the-art methods

can be composed of many more components. These include among others a loop-closing thread

scanning for candidates in older Keyframes when revisiting locations and a background task

performing graph optimisation whenever a loop closure is successful to minimise the overall

error of a trajectory. However, the core components remain the first two. Firstly, they are

the ones that are latency sensitive when discussing real-time SLAM. If loop closure is delayed,

the local accuracy of SLAM should not be strongly impacted, however a skipped frame during

tracking will adversely affect all components and in rapid movement may lead to tracking loss.

Moreover, if these two tasks function well they immediately lead to accurate visual odometry

and a local (and in the case of LSD-SLAM semi-dense) reconstruction of the environment, while

the resulting Keyframes can be stored for later processing to recover a global reconstruction

at a later time or offline. Finally, as will be discussed in profiling results presented in the next

chapter, they also occupy most of the computation time. These reasons drove the decision

to focus on these two tasks with regards to hardware acceleration and base the design on a

heterogeneous platform with a mobile CPU executing everything needed to support large-scale

SLAM as software and custom accelerators targeting tracking and mapping.

2.4 Algorithmic overview of LSD-SLAM

The two main tasks of SLAM, as we have seen and will discuss in detail in the following sections,

are tracking and mapping. The main inputs and outputs of these tasks will be discussed in this

section, as well as the main functions they are composed of. The following data structures are

used in LSD-SLAM for tracking and mapping:

2.4. Algorithmic overview of LSD-SLAM 29

Camera Frame

The Frame data structure is used to contain each captured image from the camera, along with

associated metadata. It will initially be used during tracking, to estimate its pose in relation to

a Keyframe, and then will also be used for mapping to update a Keyframe’s depth observations,

using the pose estimate produced from the tracking task. Each Frame contains the following:

– Dimensions: Width and height in terms of number of pixels.

– Timestamp: The time of capture from the camera.

– Frame Id: Integer to efficiently identify the ordering of captured frames

– Intrinsic camera Matrix K: One static copy associated with all frames, it describes

all the intrinsic imaging and optical parameters of the camera and lens combination.

– Image: Pixel array, the actual image captured from the camera.

Keyframe

A Keyframe represents a portion of the generated map. It is essentially a camera frame and

a collection of depth estimates for a subset of its pixels. Each depth estimate is in terms

of distance from the plane of projection of the camera, and is one dimensional. However a

Keyframe also contains a camera to world pose for its frame, allowing the depth of a Keypoint

to fully represent the 3D position of observed parts of the world. Each Keyframe contains the

following:

– Pose: Camera to world pose estimate for the Keyframe’s camera frame.

– Frame: A camera frame that was selected as a Keyframe candidate and its associated

metadata.

– Keypoint Array: Array has same dimensions as Frame (width x height). Each Keypoint

is a potential depth observation, and is composed of the following variables:


– Valid bit (Is there a valid observation for this pixel)

– Blacklisted (Has it been blacklisted after multiple failures to match)

– Validity Score (Used to keep track of successful or failed attempts to match)

– Inverse Depth (Depth estimate after a map update)

– Inverse Depth Variance (Estimated variance for above depth)

– Smoothed Inverse Depth (Smoothed version of inverse depth)

– Smoothed Inverse Depth Variance (Smoothed version of variance estimate)

– Max gradient array: Same dimensions as Frame. The maximum gradient in a region

around every pixel. A precise definition of its calculation is included in Chapter 3.

Algorithm

The main tasks and the control flow between tracking and mapping are presented in Fig. 2.1.

There are other secondary and background functions involved in off-the-shelf SLAM but they

are either not relevant to this thesis or they do not significantly affect performance so they will

not be discussed in this section.

The main input is a series of images from a camera that are immediately converted to a Frame

data structure as they move through the algorithm. The other main data structure is the

Keyframe, storing the current estimate of the local map. These two main data structures are

shared and need to be synchronised continuously between the two tasks. In the figure they are

indicated with orange coloured boxes, while the light blue boxes represent the main functions

necessary to perform tracking and mapping.

The Track Frame function uses the tracking reference, composed of a Keyframe and its max

gradients array, along with the previously tracked Frame’s pose estimate, to estimate the pose

of the new camera Frame. The tracked Frames are then placed in a queue, depicted with orange

colour in Fig. 2.1, to be used to refine the Keyframe’s depth estimates for each Keypoint and

generate new observations, stored as new Keypoints. In case of a large performance difference


Mapping

Capture new

image

Secondary background functions

GUI

Loop closure detection

New Camera Frame

Tracking

Get tracking reference

Track Frame

Calculate Keyframe closeness

score

New Frame + Pose

Generate tracking

reference Update Keyframe

Create new Keyframe

Keyframe, max gradients and pose

Keyframe data structure

Distance score over threshold ?

NoYes

Unmapped tracked Framesqueue

Retire previous Keyframe

Graph optimisation

Keyframe graph

Figure 2.1: SLAM algorithmic overview. The main tasks are inside the two dashed rectangles.The main data structures are indicated with orange coloured boxes, while the light blue boxesrepresent the main tasks necessary to perform tracking and mapping. With grey we indicatesome of the background tasks involved in full SLAM.

between mapping and tracking, the oldest tracked Frames can be dropped but this can result

in a reduction in the quality of the map and the robustness of the algorithm. Once the “close-

ness score” between Keyframe and currently tracked Frame goes over a threshold, the depth

estimates are transferred to a new Keyframe initialized from a recently tracked Frame. A key

characteristic of the control flow of the algorithm is the interdependency between the two tasks

of tracking and mapping and the requirement for a high communication bandwidth between

them.

Computational cost

The dependent variables in the algorithm are mainly the scene complexity (how many high-

texture areas are visible in the camera’s image) and the quality of the camera and lens apparatus

which can influence the sharpness, contrast and optical and electronic noise present in the digital

image. Noise and lack of focus is undesirable and will always reduce the quality and robustness


of the algorithm.

Higher scene complexity means:

• More points to map and track (for the successfully mapped ones), a denser map

• More amount of useful information, and the potential for a better accuracy

• Higher computational cost, since there is more data to process for each tracking and

mapping update.

In Section 1.2.4 we discussed independent variables in SLAM from a theoretical standpoint.

In practice, in LSD-SLAM the below are the main tunable independent variables that affect

quality and complexity.

Frame Resolution - Tracking and Mapping:

This can be the camera’s resolution, or lower by subsampling the original image. Same applies

to the Frame used in mapping. Most often in LSD-SLAM tracking uses a subsampled version

of the Frame used in mapping, where the dimensions are halved for performance reasons.

Pyramid levels during tracking:

Starting from the captured Frame, the image is subsampled by a factor of 4 (each dimension

divided by 2) a number of times, generating a sequence of image levels with a progressively lower

resolution. Tracking is then performed first at a very coarse resolution, and after convergence

repeats at the next finer level, using the pose estimate of the previous level as a starting point.

This process significantly improves the convergence radius and robustness of the algorithm with

a relatively small increase in computation time.

Maximum iterations threshold for Tracking:

Tracking is iterative, and lasts until the error or convergence rate are very low, or a maximum

number of iterations are reached (to improve the worst-case computation latency). These

parameters are tunable and can affect worst-case runtime and accuracy.

All of the above will be discussed qualitatively and in-depth in Sections 2.5 and 2.6 as well as

Chapters 4 and 5. However at a high level their effect can be modelled as follows.


Tracking operates on a set of k mapped points stored as Keypoints in the Keyframe, where

0 ≤ k ≤ N and N = width × height is the resolution of the Frame used for mapping. If a

different resolution is used for tracking and mapping, a subsampled version of the Keyframe

needs to be created as well, where each subsampled Keypoint is assigned a weighted average of

the properties of the 4 Keypoints it replaces. Further subsampled versions of the new Frame and

the Keyframe are then created for each pyramid level used in tracking, each with a resolution

1/4 the resolution of the level above it.

For infinite pyramid levels, the amount of pixels included in all pyramid levels subsampled from

a starting resolution N, would be ( 14

+ 14· 1

4+ 1

43+ ...)N. As that is a geometric series, we know

that:∞∑n=1

1

4n=

1

3

Therefore the amount of pixels to process is always less than or equal to:

N +1

3N =

4

3N

for tracking and N for mapping. Though one iteration of tracking is less computationally

intensive than a complete update of the Keyframe, it is repeated multiple times for each level.

However, the upper bound of that is known, and tunable. The maximum iterations per level

are progressively fewer for lower subsampling levels, from multiple tens of attempts for coarser

levels to less than 10 for the finer and most computationally expensive one. Therefore, the

worst case latency for tracking is asymptotically bounded by a constant multiple of 43N .

Gradient threshold:

Pixel selection depends on the maximum gradient in an immediate neighbourhood, which is

affected from scene complexity, but the gradient threshold used to perform the selection is a

tunable independent parameter.

By tweaking the gradient threshold, and depending on the scene complexity, we can tweak

the subset of pixels that we are processing for tracking and mapping. For typical scenes and

thresholds the number of valid Keypoints is k ≈ 0.20N . Mapping has to process all N pixels


to generate the k valid observations. As we shall discuss in the following sections, the cost per

Keypoint varies, and a non-valid Keypoint will require even less processing, but the maximum

amount of processing steps is bounded.

Therefore, although the precise execution time is tunable through the parameters above, both

tracking and mapping have a complexity of O(n), where n is the pixels present in the camera’s

Frame. There are other computations associated with executing and linking these two tasks,

but they are all either constant time with a complexity of O(1) or are equivalent to N multiplied

by a constant and therefore Θ(n). As such, the broader computational cost of LSD-SLAM is

in the order of O(n). A more precise theoretical discussion of the different steps involved in

the algorithm will be introduced in the next two sections. Then the tracking and mapping

functions will be described in more precise detail in Chapters 3 to 5, along with their hardware

implementation.

2.5 Tracking in LSD-SLAM

The next two sections will introduce the key ideas behind direct semi-dense tracking and semi-

dense depth map estimation, based on the works of J. Engel et al: “Semi-dense visual odometry

for a monocular camera” [1] and “LSD-SLAM” [17]. We will discuss the basics of how they

were implemented in software and the characteristics that turned out to be important during

the research work presented in this thesis.

The basis of LSD-SLAM is semi-dense photometric tracking, closely coupled with a semi-dense

filtering depth estimation algorithm. The aim of tracking is to recover the camera pose with

respect to the world for each camera frame. LSD-SLAM projects the observed 3D points of the

world from the generated map to the current camera frame to optimise directly on the pixel

intensity differences in a process called direct whole-image alignment. This is expressed as a

variance-normalised weighted least squares minimization of the photometric error. For tracking,

only the information-rich points in the camera’s view are used based on the photometric gradient

value around a point. After the optimisation process, a pose estimate is generated for the

2.5. Tracking in LSD-SLAM 35

camera, describing its position and orientation in 3D space. The algorithm then uses that

pose estimate to recover the depth of observed points, according to the principles of multi-

view geometry [47], while existing depth observations are updated as a Gaussian probability

distribution with their associated mean and variance.

A key idea is a depth map, propagated from frame to frame and continuously used for tracking

while being updated with every newly tracked frame using two-view stereo. The depth map is

initialised from a camera frame, matching and storing a maximum of one depth estimate per

camera pixel location. Each of these estimates is stored in a data structure, along with the

frame it was initialised on, as a Gaussian probability distribution with its variance estimate.

From this point forward, the data structure containing all of the above information with the

associated camera frame will be called a Keyframe, and the valid depth estimates inside it are

also referred to as Keypoints. After a sufficient camera displacement, when a distance/disparity

heuristic between the current frame and the Keyframe is satisfied, a new Keyframe is generated

from the new point of view. Then, the previous one is retired and is incorporated in a graph

of Keyframes, which for LSD-SLAM represents the global map. Local and global mapping will

be described more in the next section.

In Fig. 2.2 we can see a visual representation of the depth and variance values collected in

the Keyframe and the calculated error (residual) from projecting these 3D points from the

Keyframe’s point of view to that of the camera. In Fig. 2.2(a) we can see the two camera images

used during tracking and mapping. In Fig. 2.2(b) and (c) we can see the estimated inverse depth

map and its associated variance for these two views, where the colour transitioning from red to

blue indicates the magnitude of the recovered depth, and variance, with red being closer, or a

smaller variance. In Fig. 2.2(d) and (e) we see a visualisation of the residuals on the left and

their variance on the right when tracking (d) and mapping (e) with lighter colour indicating

a larger value. Finally, Fig. 2.2(f) visualises the normalising weights for the two views, used

during tracking optimisation.

Notation: In this section, matrices are denoted as bold, capital letters (R) and vectors as bold

lower case letters (t).


Figure 2.2: Figure adapted from [1] to demonstrate tracking with direct Keyframe alignmenton sim(3) utilizing the estimated depth map. We can see from two camera views the currentstate of the map on the left as a collection of inverse depth and depth variance values for themapped points and the photometric residual on the top right as a result of the reprojection.

The camera poses are initially represented as 3D Rigid Body Transformations as in Eq. 2.1,

while the pose to pose constraints between Keyframes are represented with 3D Similarity Trans-

formations, which have an additional scale factor s ∈ R as in Eq. 2.2:

T =

R t

0 1

∈ SE(3) (2.1)

T =

sR t

0 1

∈ Sim(3) (2.2)

LSD-SLAM uses pixels characterised by a gradient bigger than a set threshold to perform

tracking and mapping. It operates under the assumption that the pixel areas that are the most


useful are those that contain a high enough intensity gradient, and therefore some kind of texture

or edge. During tracking, LSD-SLAM does not require the intermediate and computationally

demanding step of detecting and comparing features to find unique matches, and at the same

time manages to create a more dense reconstruction of the environment, offering a good middle

ground between sparse and dense algorithms.

During optimisation, LSD-SLAM uses the vector ξ ∈ se(3) of the associated Lie-algebra directly

as a minimal way to represent the camera pose instead of using SE(3) notation as it allows

optimisation on the pose directly. Lie Algebra offers a way to apply mathematical optimisation

methods to the poses obtained, by representing them as elements of a Lie group, which is a

locally Euclidean differentiable manifold. After the first estimation, each pose will be processed

and updated by an iterative optimisation process, the Levenberg-Marquardt algorithm [48].

They can be converted back to SE(3) with an exponential map G = expse(3)(ξ).

The pose is recovered by starting from a pose estimate T and aligning the frame that is tracked

with the current Keyframe, to minimise a photometric error. The inputs of the tracking task

are the initial pose estimate, the camera frame to be tracked and the semi-dense depth map

stored in the Keyframe. The optimisation happens over the variance-normalized sum of the

intensity disparity, as in the following equation:

Ep(ξji) =∑p∈ΩDi

∥∥∥∥ r2p(p,ξji)

σ2rp (p,ξji)

∥∥∥∥δ

(2.3)

in which r is the residual, calculated for the subset of pixels p in the Keyframe which contain

a valid depth value Di. Its value is the photometric disparity of projecting a Keypoint p to

the frame captured at the camera’s current viewpoint based on the current pose estimate, as in

Eq. 2.4. This intensity is further refined with an intensity estimate Ij through an interpolation

of the values of the 4 pixels surrounding the projected point on the camera frame. In Eq. 2.4

this sub-pixel interpolation is symbolised with ω.

rp(p, ξji) := Ii(p)− Ij(ω(p, Di(p), ξji)) (2.4)


In Eq. 2.3, σ2 is the residual’s variance, computed using covariance propagation and utilizing

the inverse depth variance Vi stored in the Keyframe for each point under a Gaussian noise

assumption:

σ2rp(p, ξji) = 2σ2

I +

(∂rp(p, ξji)

∂Di(p)

)2

Vi(p) (2.5)

The operator

∥∥∥∥.∥∥∥∥δ

is the Huber norm, calculated as in Eq.2.6, a normalizing factor that reduces

the effect of larger residuals on the optimization process improving the accuracy and robustness

of the algorithm in the presence of outliers.

∥∥∥∥r2

∥∥∥∥δ

:=

r2

2δif |r| ≤ δ

|r| − δ2

otherwise

(2.6)

Finally, the sum is minimized using the Levenberg-Marquardt method which performs a damped

Gauss-Newton optimisation. Gauss-Newton is based on a local Taylor approximation of the

error function to improve how rapidly and accurately the optimisation converges to a minimum

compared to gradient descent. Levenberg-Marquardt converges faster by adding a positive

multiple of an identity matrix of the same size to the Hessian matrix: H(x) + λI essentially

interpolating depending on the value of lambda between a traditional Gauss-Newton and gra-

dient descent and is better able to deal with non-linear functions such as the photometric error

used here compared to traditional Gauss-Newton implementations [48].

In this implementation, the Hessian is substituted with the approximation JTJ , so the step is

formulated as:

δξ(n) = −(JTJ + λI)−1JTr(ξ(n)) (2.7)

where ξ ∈ SE(3) and n is the optimisation step, with:

J =∂r(ε ξ(n))

∂ε

∣∣∣∣ε=0

(2.8)

Additionally, in LSD-SLAM a weighting scheme is implemented, where in each iteration a


weight matrix W = W (ξ(n)) is generated depending on the residual and the depth certainty

(from the variance estimate). The residual in the iteratively solved error function then is

multiplied by the weight factors, the error function becomes:


∥∥∥∥w(p,ξji)r2p(p,ξji)

σ2rp

(p,ξji)

∥∥∥∥δ

(2.9)

Thus the update is computed by solving

δξ(n) = −(JTWJ + λI)−1JTWr(ξ(n)) (2.10)

The goal of the optimisation process is to estimate an update δξ(n) that converges towards

a minimum of the error function as quickly and accurately as possible. The first step is to

calculate the residual, weights, and perform one iteration of the optimisation process. After

this step, the linear system generated is solved for different values of λ, the residual and weights

are recalculated and a tentative update step is generated and applied to generate a new pose.

If the new pose decreases the error then the pose update is applied as follows:

ξ(n+1) = logSE(3)

(expse(3)(δξ

(n)) · expse(3)(ξ(n)))

(2.11)

and the Hessian and Jacobian are recalculated from a new position. Otherwise the update is

rejected, and the system is solved again for a new lambda. This process is stopped if a maximum

number of iterations is reached, if the error decreases by an amount smaller than a threshold

or if the update step is smaller than a set threshold, in an effort to improve convergence and

reduce the number of iterations for improved computational efficiency, especially in the light of

pyramid processing.

The discussed optimisation process, is applied for different resolution levels, in a pyramid

representation at different levels of subsampling from coarse to fine. This aids convergence, as

direct image alignment is inherently a non-convex optimisation problem. This coarse-to-fine

processing works by initially smoothing the search space and optimising there, so that at the


next, finer step, the search is initialised at an already good estimate. Starting sometimes at a

resolution as low as 15x20 or more typically 30x40, it was proven to be a very good solution to

increase the convergence radius and accuracy in a variety of scenarios [17].

Level 1 – 1/2 Resolution

Level 0 – Original Resolution



Figure 2.3: Pyramid processing. Starting at a lower, coarser resolution increases the radius ofconvergence for the optimisation and improves the speed of reaching a minimum

As tracking has to begin at a resolution many steps coarser than the current camera/map

resolution, for every frame tracked a generation of subsampled levels is required for the two

data structures: the current camera frame and the Keyframe. In the case of the camera frame,

this is a simple averaging over four pixels for successive division of the resolution by two in both

dimensions. To support the multi-level tracking process a subsampled version of the Keyframe

is also generated. Firstly, the pixel values of the Keyframe’s image are subsampled in the same

way as the current camera frame. Simultaneously, to assign depth and depth variance to these

coarser Keypoints, the algorithm combines the depth map observations (if valid) across the

four pixels at the next finer pyramid level, weighing the process with their variance values to

give more importance to the most confident observation. If the resulting confidence is below

a threshold (or indeed all four constituent Keypoints are invalid) the Keypoint at this level is

simply marked as invalid as well.

2.6. Mapping in LSD-SLAM 41

2.6 Mapping in LSD-SLAM

After the tracking task provides a pose estimate the first step for mapping is to decide which

points to map. In the literature of SLAM different methods have been used to judge point

quality for sparse or semi-dense SLAM, usually centring on the idea that changes in pixel

intensity indicate a possible feature location [29, 33]. In the semi-dense methods we discuss

here, the points to attempt to map are selected based on the maximum gradient of a pixel

and of its immediate neighbourhoods being above a threshold. Pixels with a higher local

gradient are considered good candidates to much and this is refined after tracking with a “pixel

usefulness” metric updated when a valid map point is used successfully or unsuccessfully for

tracking. This is a natural choice since semi-dense tracking in LSD-SLAM attempts to directly

use image intensity gradients to optimise camera pose estimation. It should be noted at this

point that in these works there is also an underlying assumption of smoothness for the world

observed.

For each depth update, the aim of mapping is to perform an exhaustive search for a new

observation of each good quality point in the Keyframe along a line on the current frame.

This line is the epipolar line, and the search is conducted using the pixel’s intensity value.

Geometrically, if the relative position and orientation of the camera is known when two frames

were captured, it is proven that a point observed on one camera frame will always project to a

line on the plane of the other camera’s frame [47] called the epipolar line. This is demonstrated

in Fig. 2.4. Two camera frames will not always observe the same point in frame, as the

epipolar line may lie completely outside the frame that a sensor will capture. As such the

search is restricted on the intersection of the line and the image frame. If a successful match is

found, this can then be used to calculate a new estimate for the depth value of that point.

Another idea utilized in semi-dense mapping in LSD-SLAM is that if there is a prior depth

observation with sufficient confidence in its observation (judged by the estimated variance in

the gaussian hypothesis) the estimated variance is used to limit the search interval to d± 2σd,

where d and σd denote the mean and standard deviation of the prior hypothesis. At the end

of the search for a good match, a sub-pixel localisation is performed at the match location in


Real World Point X

OR

X2X1

X3

eL eROL

XL XR

Epipolar Line

Figure 2.4: Epipolar geometry, the epipolar line is depicted with orange colour

an attempt to reduce the error further, by interpolating between two neighbouring matches.

Finally, instead of scanning to match a single pixel, a squared error function comparing 5

equidistant points is used to improve accuracy. This approach significantly increases robustness

and gives a small increase in complexity since it still places and searches for these points along

the same line, meaning successive values can be reused.

After a successful disparity search along the epipolar line, and an optional step of subpixel

refinement, the depth of the point is calculated. In LSD-SLAM an inverse depth formulation is

used and stored for each point, and the authors have included both photometric and geometric

errors in the formulation of the uncertainty σ2d. If this is the first successful observation of

a point, it is initialised with this observation as its depth estimate. Subsequent observations

are incorporated into previous ones by multiplying the two distributions (corresponding to the

update step in a Kalman filter) [1].

During the lifetime of the SLAM algorithm, after a sufficient distance or viewpoint change

from the pose of the current Keyframe a new Keyframe is instantiated from the latest camera

frame, to which the current depth map is propagated. Using the two frames’ relative 6-DoF

2.6. Mapping in LSD-SLAM 43

Figure 2.5: From [1], the top row is different camera frames overlaid with the estimated semi-dense inverse depth map. The bottom row contains the camera view as a pyramid with blueedges, with its associated trajectory as a line in front of a 3D view of the tracked scene.

position, the inverse depth for each point is calculated for the new position with a small increase

in uncertainty. Keeping the uncertainty increment small was found to be more effective at

reducing drift [1]. The point is then stored as a depth estimate to its closest corresponding

integer pixel location in the new frame. In the case of two points mapping to same frame pixel,

if they are statistically similar they are fused as two observations. Otherwise, the point with

the largest depth is treated as occluded and deleted.

The final steps of depth mapping have to do with regularizing the depth map after every update

step. There are two algorithms at work here. The first one has to do with adding observations in

“holes” surrounded by successful neighbours and with removing outliers if a pixel or most of its

surrounding neighbours have become invalid. The notion of valid and invalid here corresponds

to a validity score stored in the Keyframe for each point, decreased or increased according to

successful or unsuccessful observations during tracking and mapping. It can also be invalided

directly in special cases, such as a gradient falling below a certain threshold in a new Keyframe

due to changes in angle and distance.

The second filtering algorithm calculates a smoothed version of the depth, stored separately.

The smoothed version is in use to initialise the search parameters for the epipolar line to the

most probable centre and range, according to a smoothness prior for the world. It can also be

used for visualisation, robotics and augmented reality applications. The exact value is used in

the rest of the algorithm and in tracking and is where the current Gaussian hypothesis remains


stored.

Lastly, LSD-SLAM uses older Keyframes that have been retired to represent the global map.

As has been mentioned previously, after a sufficient distance/disparity heuristic threshold, a

Keyframe is replaced by one closer to the current view of the camera, and the depth values it

contained are propagated to the new one. Once a Keyframe has been retired it is incorporated

in a global pose-to-pose graph. In that graph, each Keyframe is represented as a vertex, with

3D similarity transforms as edges, as in Eq. 2.2, adding scale information as a 7th degree of

freedom. This scale factor is included in the vertex to keep track of scale changes since for

every propagation of depth between Keyframes the map is scaled to keep the average depth

value close to 1.0.

Since in monocular methods absolute depth values are unrecoverable due to the scale ambiguity

problem, the depth map is always relative. Taking advantage of this, scaling the map to a value

of 1 when creating a new Keyframe improves the behaviour of the optimisation processes of

tracking and mapping, while simultaneously keeping track of scale changes between Keyframes.

The algorithm then uses a graph optimisation method to reduce tracking drift in the background

during loop closures, using only pose to pose constraints to increase computation efficiency. The

graph optimisation framework used in LSD-SLAM is g2o published by Grisetti et al. [49] and is

available as an open source library. This, combined with loop closure and the ability to revisit

and re-use old Keyframes improves LSD-SLAM’s large-scale capabilities and accuracy.

2.7 Proposed Architectures and FPGA-SoCs

The architectures proposed in this thesis were designed with a heterogeneous system-on-chip in

mind. The central idea is that different processors and custom hardware can provide comple-

mentary capabilities. Thus, a heterogeneous system-on-chip can be greater than the sum of its

parts. A mobile CPU can provide the flexibility and compatibility with off-the-shelf software

implementations and libraries to enhance the support for state-of-the-art SLAM. However, such

SLAM algorithms need a higher level of performance than any mobile CPU can provide. They

2.7. Proposed Architectures and FPGA-SoCs 45

are data intensive and characterised by a high amount of operations per data point and variable,

complex control-flow.

In contrast, specialised accelerator units, designed specifically to take advantage of the char-

acteristics of the algorithm, stand to offer significantly more compute performance in a more

power-efficient platform for the performance critical parts of the application. Implemented in

the same chip as the CPU, due to physical proximity, the latency and energy cost of commu-

nication between the accelerators and the CPU is minimised, which enables cooperation in a

finer granularity and a more efficient manner. Simultaneously, when operating in the same

memory space this cooperation will be more efficient with having access to the same off-chip

resources, such as an off-chip DRAM. For example, a single well-designed DRAM controller

will improve the efficiency of memory traffic from all sources. Moreover, the proximity of the

CPU’s caches to the custom memories in the accelerator offer the opportunity of coherency

protocols to be implemented to achieve low-latency synchronisation of shared data. A concept

of this architecture is depicted in Fig. 2.6.

Mobile CPU

Cache Subsystem

Memory Controller

DRAM

Mapping Tracking Camera

Shared Image Caches

Figure 2.6: Concept heterogeneous architecture block diagram


Field Programmable Gate Arrays

However, as mentioned in the introduction, the cost of implementing these designs directly

as ASICs is very high. Furthermore, the application domain is still actively investigated and

changes arise often in the state of the art, meaning the long development cycles characteristic

of custom silicon cannot provide an immediate solution. Because of this, the designs presented

in this thesis were developed with an FPGA-SoC in mind that can offer most of the benefits

of custom hardware while reducing development and deployment costs and time significantly.

As a result, an off-the-shelf FPGA-SoC can be used directly by researchers in the field and

adapted for different needs, or included in industry projects. Once the architectures and the

algorithms both mature, most of our architectural choices stand in the realm of fully-custom

ASICs and they can then be adapted to that domain to realise even higher performance and

power efficiency.

Moreover, modern FPGAs now come with many hardened units for frequently used or expensive

operations. In Fig. 2.7 we can see the building blocks that make up a modern FPGA fabric.

Look-up tables that emulate digital gates are organised in logic blocks with extra capabilities

while dedicated SRAM memories form “columns” of Block RAM (BRAM) between them for

low latency multi-port access. Dedicated DSP blocks, containing a multiplier, double precision

accumulator, pre-adder and other hard-coded operations (such as XOR and pattern detection)

implemented in silicon enhance the performance and reduce the resources needed for many fixed

and floating point math and binary operations. All of this is surrounded by fast reconfigurable

interconnect which improves routing significantly over earlier FPGAs. The combination of these

advances comes with a lower latency and resource cost than logic-only based FPGA fabrics of

the past, significantly improving the gap in performance and area discussed in the introduction.

These resources are organised in slices, represented as the smallest parallelograms in the figure

and then clock domains, surrounded by switch interconnect. Finally, at the border usually

there are dedicated I/O circuits and ports to connect to other resources on and off chip. This is

summarised in Fig. 2.8, presenting an example block representation of a modern FPGA fabric.

As such, the performance gap to ASICs comes an order of magnitude or less for some designs,


I/O

BRAM

BRAM DSP

DSP

LogicLogic

LogicLogic

LogicLogic

LogicLogic

LogicLogic

Switch interconnect

Switch interconnect

Figure 2.7: FPGA architecture. Dedicated SRAM memories and DSP blocks with capable hard-ened multipliers have significantly improved the efficiency of frequently used and traditionallycostly operations

I/O I/O

I/O

Figure 2.8: A modern FPGA fabric is organised in clock domains made up of slices with theedges of the silicon usually housing communication circuits and ports.


allowing the use of architectures implemented in FPGA-SoCs as permanent accelerators in

production level devices. In this thesis, our work will be presented first as an architecture,

with some choices based on the FPGA-SoCs used for implementation but otherwise at a level

that can be generalised to other platforms. Then the evaluation of the architecture will be

presented, where the design will be discussed as implemented on an off-the-shelf FPGA-SoC

device.

The FPGA-SoCs targeted in this thesis are of the Zynq family from Xilinx. The CPU and

FPGA can function independently, and coexist on the same chip sharing a capable and efficient

interconnect allowing our designs to be realised at a good performance level. In Fig. 2.9 we

have a simplified view of the parts of the SoC and interconnect that are relevant to this thesis.

Connections to peripherals and details not relevant to our discussion are omitted for clarity. One

can refer to Chapter 5 on the Xilinx Zynq-7000 TRM [2] for more details on the interconnect

of the SoC and the rest of the system-on-chip.

DRAM ControllerReconfigurable Logic

512kB L2 Cache and Controller

L1 I/D Cache

Memory Interconnect

64-bit

Master Interconnect

for Slave PeripheralsGeneral Purpose

Slave Ports

High Perf.AXI

(HP[3:0])

FIFO

FIFO

FIFO

FIFO

32-bit

64-bit

ARM Cortex-A9Dual Core

Figure 2.9: Zynq 7-series FPGA-SoC and Interconnect. Only connections relevant to thearchitectures researched in this thesis are included for clarity of presentation. Source: Zynq-7000 TRM [2]


The FPGA has dedicated high bandwidth buffered ports leading to an interconnect that allows

simple adjustments such as combining multiple sequential requests for increased efficiency. This

links then to a high-efficiency DRAM controller with content-addressable memories (CAMs),

that can further reorder the requests at the DRAM level to maximise the achievable efficiency.

In addition to these connections, there are dedicated slave-master, and master-slave general-

purpose ports to the CPU that can be used for control and more fine-grained communication and

other ports to peripheral devices. For our architectures the focus is on the general purpose AXI

controllers, the four high-performance ports (exposed to the FPGA as AXI HP[3:0]) functioning

at a maximum of 150 MHz, as well as the ARM Cortex-A9 CPU and its cache subsystem,

communicating at more than double the frequency of the FPGA side but through a single port.

High-level Synthesis

Finally, the architectures presented in the next chapters were designed and implemented using

Xilinx’s Vivado High-level synthesis (HLS) tools. These allow the hardware to be expressed

as a combination of C/C++ routines and pre-processor pragmas which guide the translation

from C/C++ to VHDL/Verilog. Using these pragmas one can guide pipelining, interfaces,

hardware unit reuse, blocking and non-blocking evaluation, resource allocation and the static

scheduling of operations to hardware units as at the HDL level. Then the designer gets a

quick estimate to the performance and can view the resulting resources instantiated and the

scheduling of operations on hardware units. Then the hardware design can be debugged first as

a C-simulation to correct logic errors faster with the help of software debug tools before testing

on the slower but more accurate RTL-level simulation.

The use of these tools greatly accelerates the development and debugging of custom hardware,

while allowing faster and easier design-space exploration. The key in this process is that these

tools are still hardware development tools. They cannot translate any software to hardware

but rather allow the expression of certain operations in terms of a higher-level language, a

subset of which is synthesizable. If correctly used, by a designer with a deep understanding

of the process and the underlying hardware, the high-level synthesis tools do not adversely


affect an architecture’s throughput per cycle and largely achieve comparable latency results for

most pipelined architectures. Instead, they bring some predictable penalties to the maximum

achievable frequency and certain resource usage overheads. At the same time they allow a re-

searcher to investigate more ambitious and complex designs in a smaller time-frame, potentially

leading to riskier but more promising architectures being explored. These overheads are hard

to quantify as there is at the time of writing limited research work comparing large complicated

designs realised in HLS and traditional HDLs with similar optimisation effort for both.

However, results from the work presented in this thesis put the achievable frequency at the

region of 100-125 MHz which is very close to other image processing designs in the literature

such as Greisen et al. [50] in 2011 and some recent works such as [51] seem to indicate that

while hand-written HDL can achieve very low latency for simple tasks, the ease of development

of HLS tools can actually result in higher performance for more complicated applications by

allowing more time to be allocated to exploring and optimising the resulting hardware instead

of simply implementing and debugging. In Chapters 4 and 5 we will discuss certain design

choices and overheads in the evaluation sections that were identified in our work and were

associated with the choice of HLS as our development language. Nevertheless, as we will discuss

in the following chapters, very high performance levels can still be achieved, which means

the implementations produced by HLS can be directly used without modification for many

applications where development time is of higher importance than the maximum achievable

performance and resources for a certain FPGA.

2.8 Related Work

Since platforms in the embedded space have significant constraints in power and performance,

current embedded SLAM implementations focus on sparse SLAM that is adapted towards re-

ducing computational requirements further such as [52,53] and feature-based implementations

of sparse visual or visual-inertial odometry [54–56] that provide limited or no large-scale map-

ping and reconstruction capabilities. A state of the art work in this category is SVO [57]. It

2.8. Related Work 51

combines a semi-direct technique for tracking that is both robust and efficient compared with

other sparse methods in order to run at a lightweight embedded platform. It achieves great

accuracy and reduced drift in under 20ms in an embedded platform, however it achieves that by

mapping a small number of points, and concentrating on visual odometry. These approaches

map a sparse selection of features that reduces the density of the reconstruction to a small

selection of 3D points and often cannot provide a large-scale coherent map during exploration.

One approach that has been explored is combining efficient visual or visual-inertial odometry

with the option of offloading computation to a base station and reconstructing a dense or global

map there, as for example in Sturm et al. [13] and others [14, 58]. This strategy can address

some applications but comes with increased power consumption for the wireless communication,

as well as increased latency. It also comes with a reduced area of operation, and very high

bandwidth requirements if it needs to operate in real-time, while the odometry cannot benefit

from the denser map to enhance its interaction with the environment. A complete SLAM

solution with higher-density onboard an embedded platform stands to overcome all of these

disadvantages and is necessary for the emerging applications discussed in the introduction.

Dense SLAM has been advancing rapidly in the last decade. There have been examples with

expanding capabilities, such as deformable objects and surfaces for full environment recon-

struction [37] and explicit object modelling. However, its requirements in sensors, energy and

computation are infeasible for an embedded platform. Works in semi-dense methods such as

LSD-SLAM, are more applicable to the embedded space thanks to lower computational com-

plexity and reliance on simpler RGB or greyscale passive camera sensors. LSD-SLAM [17] for

example, provides a tracking accuracy comparable to other state of the art sparse methods but

generates a higher density map that provides significantly more information about the envi-

ronment. As such, it was selected as the target for the custom accelerator presented in this

work.

Executed on a general purpose CPU, LSD-SLAM requires a quad-core high-end x86 CPU to run

at real time. On the embedded space attempts to run SLAM such as this have had to reduce

functionality to simple visual odometry and run at reduced resolution. Since the runtime is


affected almost linearly with the number of points to process, reducing the number of pixels by

a factor of 4 and optimising the algorithms features to run at a mobile device, Schoeps et al

achieved a performance of approximately 20 frames/sec for mapping at a resolution of 320x240

and tracking at 160x120 [59], trading off accuracy and richness for lower complexity. However,

code optimisation is not enough to allow a mobile device to handle the original resolution,

density and other features without a large drop in performance below real-time specifications.

Recently, there have been attempts in designing custom hardware for SLAM in the embedded

space. Suleiman et al. [60] demonstrated a custom ASIC design for visual-inertial odometry

targeting nano-drones. It belongs in the category of sparse odometry and achieves high perfor-

mance together with power efficiency, realised as a chip fabricated on a 65nm CMOS process.

It enables environment awareness for very lightweight robots, but because of its specialisation it

only performs the version of sparse visual-inertial odometry it was designed for and cannot be

extended to semi-dense or dense SLAM. This is a typical example of an optimised ASIC imple-

mentation of an algorithm, which trades flexibility and cost to achieve the highest performance

and power efficiency for a specific task.

Moving from ASICs to reconfigurable hardware, work on FPGAs in the past has been limited

in scope to accelerating selected computation kernels for sparse SLAM as pre- or co-processors

such as [61,62]. For example, FPGAs have been used to implement feature detectors based on

SURF [63], SIFT [64] and many others, as well as stereo disparity estimation such as [50, 65].

A very interesting example of both, where the FPGA is acting as a coprocessor is [66] where

FPGAs are considered for Autonomous Planetary Rover Navigation. In this line of thought,

Honegger et al. [67] proposed a custom board combining an FPGA and a mobile CPU for

robotic vision, evaluated by offloading a popular type of disparity estimation algorithm (SGM

stereo) to the FPGA. One characteristic of this architecture is a one-way link with the FPGA

between the camera and off-chip memory. This can cover some use cases, for example the

pre-computation of feature extraction, but is limited in the amount of processing it can do.

In contrast, we target a more flexible system architecture that can allow more fine-grained

cooperation between hardware and software.


These works also cannot directly lead to significantly low power, high-density SLAM since they

deal with only pre-processing steps, leaving a large percentage of the computation to the CPU

that has to receive this data. This means their resulting performance-per-watt will remain close

to that of the general purpose CPU if a large percentage of the computation time is spent there.

The work presented in this thesis targets more complete functionality from the accelerators’ side

in order to achieve very high throughput levels with low latency between frames, at a much

lower power consumption, targeting smaller and lighter platforms. Moreover our approach

avoids the pitfall of requiring hardware based integration and knowledge, instead exposing the

accelerator as a C++ function with an integrated driver underneath, in a full Linux-based

operating system. This makes it more approachable to the computer vision community by

reducing the difficulty and expertise required in integrating it with existing software-based

frameworks.

Before our work, to the best of our knowledge, there was no hardware implementation of a

complete SLAM algorithm, and looking at visual-inertial odometry the only complete hardware

implementation on a chip is Navion [60], mentioned above. In contrast, our work targets a more

complete implementation of a the semi-dense tracking and mapping tasks. Some sub-tasks are

still allocated to the mobile CPU where that choice is more efficient, however the accelerators

presented in this thesis are an almost complete implementation of state-of-the-art semi-dense

tracking and semi-dense mapping and operate with a much more fine-grained and efficient

cooperation with the CPU.

Evaluation of different works in the field of SLAM

One major challenge in comparing these works with ours is the variation in different evaluation

methods and in information reported in publications. Some software-based works include some

newer standardized benchmarks of the SLAM space, such as the TUM datasets [68], the ICL-

NUIM dataset from Imperial College London and the National University of Ireland Maynooth

[46] and the EuRoC MAV dataset [69]. Newest open versions of all of these sets can be found

in the respective websites of the academic institutions mentioned in the papers referenced here.


These datasets were a major step forward in comparing different SLAM methods, offering a hard

ground truth and multiple measures of the trajectory error of an algorithm over the lifetime of a

dataset. However, different papers might use only a small subset of these, or initialise their work

with different conditions, in an approach that can distort the final results. Moreover, looking at

the performance of different methods, papers published in computer vision and robotics venues

often neglect to provide a detailed or rigorous evaluation of the computational performance of

their methods. So while some works will clearly state average and even performance over time,

and describe their platforms in terms of exact CPU model and amount of memory such as

ORB-SLAM [41] and Whelan et al. [36], which can be used to extract an accurate estimate of

performance requirements, others are more vague. Also, information on optimisation levels and

the optimisation effort from the designer are often missing and can vary significantly between

implementations.

Evaluation of the work presented in this thesis

As such, the metrics with which we compare this work to related works and evaluate it are as

follows. First, since most low-power solutions focus on sparse SLAM or visual odometry, one

level of comparison is in terms of features and tracking/map density. The work presented in

this thesis aims to achieve the same power consumption and tracking latency as these works

but offering the full range of features in a state-of-the-art SLAM and a richer semi-dense

map reconstruction. As we have not modified the algorithm this work was based on, but

chose to remain faithful to the original implementation, the quality and accuracy of SLAM

utilizing the presented architectures is the same as the base algorithm, LSD-SLAM [17]. In the

different published works and presented architectures care was taken to test and establish this

equivalency to the software solution in terms of functionality and results.

Secondly, an important aspect of this work is its power requirements and performance-per-watt.

Using information provided in different papers we can attempt to create a map of the typical

power consumption of different platforms used. For instance, by knowing that a laptop version

of an Intel i7-4700MQ CPU was used for all experiments, a typical expected power consumption


should be in the range of 36-47W with a high multi-threaded load 2 such as the one LSD-SLAM

would produce. This method can introduce errors of a few percentage points but should give

a consistent estimate of the order of magnitude of power requirements for SLAM algorithms

published in the last few years. For papers that focus more on power consumption and report

power measurements the comparison is with the published power figures directly. In the next

two chapters, for the presented architectures power was estimated from the synthesis and place-

and-route tools, and in Chapter 5 power was measured directly at the wall for different test

platforms and our implemented accelerators while executing LSD-SLAM to supplement these

estimates with real-world measurements.

Finally, the achieved performance of the presented work is evaluated in every chapter. This

is measured in terms of the latency to track a frame or update the Keyframe’s map, or the

inverse of that latency as frames per second for the tracking and mapping tasks’ throughput.

Since performance varies significantly with the features, map density and used resolution for

different algorithms, this part of the evaluation is a direct comparison with LSD-SLAM as pure

software, and the proposed hardware-accelerated version for each chapter. For Chapters 3 and

4 the ARM Cortex-A9 dual core CPU in Xilinx Zynq SoCs is used to provide a baseline for the

software version. In Chapter 5 finally, a high-end desktop CPU (Intel i7-4770) and a high-end

quad-core mobile CPU (ARM Cortex-A57) are used for performance comparisons. The first of

the two establishes the performance achievable from an optimised implementation on a modern

CPU if power is not a constraint, while the second showcases the performance achievable from

an optimised implementation on a modern high-end mobile CPU used in a high-performance

mobile system where performance-per-watt is important. Comparing to the literature, Engel et

al. [17] report performance in the region of 30 frames/sec. for the same resolution as our tests,

using a high-end quad core laptop CPU that as we discuss later in this section has a typical

power consumption of 35-45 W.

The above experiments were conducted with a selection of datasets that represent two different

real-world environments; the first one is in a small but cluttered room indoors while the second

2For this figure we utilise the manufacturer’s thermal design power and specifications [70], with an estimatedload of 60-80% across at least three out of four cores


one is in an outdoors area with loop-closures (revisiting the same space multiple times), with

a combination of close objects, medium-distance buildings and far-away features such as trees

and clouds. These were provided by Technical University Munich (TUM), by the authors of

LSD-SLAM on TUM’s website3. These datasets were used for their quality and the variety

of environments they contain to provide measurements for the performance of the proposed

hardware compared to a software-only version, and to ensure equivalency with the software

implementation.

As such, the comparison to related work is made in two dimensions. The first one is to compare

the overall solution our work provides, in terms of map density, features, and with the accuracy

of a state-of-the-art method, with other embedded and desktop solutions. In this dimension

we execute a more complex, higher-density SLAM algorithm compared to the state of the art

in embedded SLAM, but at the power requirements of mobile, sparse visual odometry. The

second dimension is the performance-per-watt improvement of running LSD-SLAM compared

to current general purpose platforms. On this dimension, we achieve a rate of processing 5 times

higher than that of a high-end mobile CPU (Quad-core ARM [email protected]) with a

comparable and lower power consumption. Meanwhile, the proposed accelerator matches the

performance of a high-end desktop x86 CPU (Intel i7-4770 @ 3.77GHz) but with an order of

magnitude less power consumption, for an overall improvement of an order of magnitude in

performance per watt compared to general-purpose CPUs for advanced semi-dense SLAM.

An overview of the field, giving a comparison across these two dimensions, and combining

hardware and software implementations across different SLAM categories is given in Table 2.1.

As will be discussed in Chapter 5, where a more concise version of such a table is used, it is not

meant to be exhaustive or rank the works. It is instead compiled to showcase the breadth of the

field, and the features and completeness that different algorithms offer towards environment

aware applications. Our work fits between the two following main approaches, positioned to

bridge the large gap in SLAM capabilities between research in algorithms and applications in

embedded systems.

3https://vision.in.tum.de/research/vslam/lsdslam


On one side there are dense examples with advanced features, such as tracking over deformable

surfaces [37] and even using a persistent 3d graph map of arbitrary reconstructed objects [38].

These approaches however are utilizing the latest and highest performance CPUs and GPUs

for acceleration with power requirements in the hundreds of watts, and usually necessitate

specialised RGB-D cameras and depth sensors that are heavy and power hungry in their own

right. On the other side, works targeting lightweight micro aerial vehicles (MAVs), medium-

sized robotics and augmented/virtual reality have so far focused purely on sparse features used

for accurate tracking, with very limited large-scale capabilities and a sparse map as a result.

With the architectures presented in this thesis, by significantly raising the performance-per-watt

capabilities of a platform for SLAM utilizing custom hardware, we can provide semi-dense, large-

scale mapping capabilities with only a passive monocular camera, and a power consumption in

the single-digits when realised on an off-the-shelf FPGA-SoC. As such, Table 2.1 showcases the

positioning of works in SLAM and visual odometry targeting a pure software implementation,

and embedded platform and hardware acceleration, and their positioning in the two dimensions

of performance-per-watt and capabilities/density, with the final line being our work as presented

in Chapter 5 combined with the work from Chapter 4.


Wor

kT

yp

eH

ardw

are

Pla

t.D

ensi

tyL

arge

-sca

leM

onocu

lar

Typic

alP

ower

Per

form

ance

Alg

orit

hmic

OR

B-S

LA

M[4

1]SL

AM

Lap

top

CP

USpar

seX

Xap

pro

x.

38-4

7W30

fps

trac

k.

2.2

fps

map

.

LSD

-SL

AM

[17]

SL

AM

Lap

top

CP

USem

i-den

seX

Xap

p.

40-5

0W30

fps

Whel

anet

al.

[36]

SL

AM

GP

UA

ccel

erat

edD

ense

Xap

p.

230-

360W

30+

fps

Fusi

on+

+[3

8]SL

AM

GP

UA

ccel

erat

ion

Den

seap

p.

250-

400W

4-8

fps

Em

bedd

ed

Leu

teneg

ger

etal

.[4

5]SL

AM

Lap

top

CP

USpar

seX

Xap

p.

30-5

0W20

fps

(sca

l.)

SV

O[5

7]O

dom

etry

Lap

top

Spar

seX

app.

30-4

0W16

6fp

s(a

cc.)

SV

O[5

7]O

dom

etry

Jet

son-T

X1

Spar

seX

app.

10-1

5W55

fps

(fas

t)

Bar

ryet

al.

[71]

Obst

acle

Det

.2x

OD

RO

ID-U

2Spar

seap

p.

20W

120

fps

Har

dwar

e

Web

erru

sset

al.

[61]

Fea

ture

Extr

.F

PG

AN

/AN

/AN

/A5.

3W2

ms

feat

.ex

tr.

Nav

ion

[60]

Odom

etry

ASIC

(65n

mC

MO

S)

Spar

seX

24m

W17

1fp

s

Ole

ynik

ova

etal

.[7

2]O

bst

acle

Det

.M

obile

CP

U+

FP

GA

No

map

app.

14-2

0W60

fps

Th

isw

ork

SL

AM

FP

GA

SoC

Sem

i-den

seX

X6-

7W60

-100

fps

Tab

le2.

1:Sta

te-o

f-th

e-ar

tSL

AM

exam

ple

s.C

ompiled

wit

ha

focu

son

feat

ure

san

dch

arac

teri

stic

sof

diff

eren

tso

luti

ons

todem

onst

rate

the

bre

adth

ofth

efiel

din

term

sof

feat

ure

san

dp

ower

typic

alor

rep

orte

dw

her

eav

aila

ble

pow

erre

quir

emen

ts.

Com

par

ison

wit

hca

mer

are

solu

tion

inth

esa

me

regi

onof

MP

ixel

s.

Chapter 3

Accelerating semi-dense SLAM

Simultaneous localisation and mapping (SLAM) is central to many emerg-

ing applications such as autonomous robotics and augmented reality. These

require an accurate and information rich reconstruction of the environment

which is not provided by the current state-of-the-art sparse, feature-based

methods in the embedded space. SLAM needs to be performed at real-time,

with a latency low enough to ensure small translation and rotation between

successive camera frames. At the same time, dense SLAM that can provide a

high level of reconstruction quality and completeness comes with high compu-

tational and power requirements and the available platforms in the embedded

space often come with significant power and weight constraints. Towards

overcoming this challenge, the first part of this chapter discusses the charac-

teristics and computational patterns of this type of application, focusing on

a state-of-the-art semi-dense direct SLAM algorithm, a novel category pro-

viding a much denser map than traditional feature-based methods but in a

more efficient formulation than other state of the art dense SLAM methods.

The second part discusses our work on designing a custom hardware accel-

erator, based on an FPGA-SoC, to provide semi-dense SLAM functionality

at a much lower power level.

59

60 Chapter 3. Accelerating semi-dense SLAM

Finally, an accelerator for semi-dense, direct tracking is evaluated. The

achieved acceleration is discussed, together with the bottlenecks and chal-

lenges that were encountered.

This chapter is based on a conference paper co-authored by Christos-Savvas

Bouganis [Boikos Konstantinos et al., FPL, IEEE, 2016 [18]]

3.1 Motivation

Semi-dense and dense SLAM is crucial in many emerging applications of robotics and embedded

vision. However, state-of-the-art SLAM comes with high computational requirements, requiring

high-performance CPUs for modern sparse and semi-dense methods, and GPU acceleration for

dense real-time SLAM. At the same time UAVs and other robots impose tight constraints on

the power and weight of the electronics on board. Some of them also impose strict latency

constraints, while other applications that require some form of SLAM, such as augmented

reality, have even higher power and weight constraints due to being worn on the head and

being passively cooled [73].

These requirements, which are not met by software optimisation or advances in general-purpose

silicon, can be solved by designing new hardware platforms, specialised for this task. In the last

decade, major FPGA companies introduced a new type of embedded System-on-Chip (SoC),

one that combines an embedded low-power CPU integrated with programmable logic on the

same chip. Utilizing the programmable logic, custom specialised hardware accelerators can be

designed to act as coprocessors along the mobile CPU, to combine the best of both worlds.

This type of heterogeneous system, while having low enough power requirements to fit low-

power platforms, stands to substantially improve performance compared to general purpose

embedded CPUs/GPUs and bring more advanced and computationally demanding algorithms

to the embedded space. As we have discussed in the first two chapters, custom hardware in

the form of an FPGA or ASIC can provide a high level of absolute performance but, more

importantly, stands to significantly improve performance per watt, meaning light and more

3.1. Motivation 61

power constrained platforms can be targeted.

SLAM is mainly composed of two strongly interdependent tasks, tracking and mapping. They

share similar low-latency requirements and are both computationally intensive. As will be

demonstrated by profiling results in this chapter, together they share more than 80% of the

computation time of real-time SLAM. They are also the two tasks that need to happen in

real-time during exploration, as discussed in the background chapter, to keep track of a moving

robot or platform, while the global map maintenance can be allocated to background tasks or

even done offline. As such, they both need to be accelerated to achieve a high-performance

low-power solution in the embedded space.

In this thesis, both tasks are addressed by complementary accelerators. However, the task

of tracking was targeted first, for the following reasons. Fast and accurate tracking is crucial

in SLAM and must be established before accurate mapping can be introduced. Firstly, it is

the entry point of all new information. If tracking skips a frame, the information it contained

cannot be used at any other point in the algorithm. Tracking has to be fast enough, with a

performance of at least 30 frames per second depending on the speed of the moving robots, in

order to keep up with the camera movement, as is demonstrated for example in [39].

Direct tracking is also based on an assumption that only a small amount of translation has

occurred between frames [1], making a high rate of processing even more crucial for fast-moving

cameras. Faster moving platforms, such as fixed-wing aircraft, necessitate even faster sustained

tracking rates than that. In the work of Andrew Barry [6] for example, stereo-based localisation

and mapping for obstacle avoidance on a fixed-wing drone processes at a rate of 120 frames per

second. Therefore, in Chapters 3 and 4 we focus on first designing an accelerator to provide

high-performance low-power tracking. In Chapter 5, the task of mapping is also addressed,

with a specialised high-throughput accelerator for semi-dense depth map estimation.

This chapter investigates the performance and acceleration opportunities for a complex, opti-

mised, state of the art semi-Dense SLAM algorithm, based on LSD-SLAM by Engel et al. [17]

with custom hardware accelerators based on an FPGA SoC. This algorithm was chosen since,

as has been discussed in the Background chapter, it offers a dense map, more suitable for


applications in robotics and its direct tracking formulation is better generalisable to a variety

of indoor and outdoor environments. Furthermore, the fact that it only requires an off-the-

shelf monocular camera sensor means that the weight and power requirements are significantly

reduced before the processing is introduced, as discussed in Chapter 2.

We begin the chapter by characterising the performance and computation patterns of the

open-source version of LSD-SLAM, a hand-optimised, multi-threaded, vectorised software im-

plementation. We then present a custom, fully-pipelined architecture targeting the three key

functions present in the tracking task. Finally, there is an analysis of the performance and

performance per watt of the designed accelerators and the bottlenecks at the accelerator and

system level.

The key contributions presented in this chapter are the following:

• First the characterisation and analysis of a state-of-the-art semi-dense SLAM algorithm

• Second, an architecture based on an FPGA system-on-chip that formed the first acceler-

ator design, to the best of our knowledge, to target state-of-the-art semi-dense SLAM.

• Third, the identification of opportunities and bottlenecks of heterogeneous FPGA SoCs

for such algorithms.

3.2 Tracking Algorithm in LSD-SLAM

LSD-SLAM is a semi-dense, large-scale direct SLAM algorithm with a monocular camera input.

Its tracking method is direct whole-image alignment with a photometric residual to recover

an estimate for the 6-degree-of-freedom pose of the camera. It subsequently uses that pose

estimate to update the current depth map estimate. The mapping task uses the recently tracked

frame with its recovered pose estimate to perform frame-to-frame epipolar stereo. For each

successful observation across the two views a depth estimate is recovered and a filtering depth

update is performed for already observed points to improve the current estimate. Mapping also

3.2. Tracking Algorithm in LSD-SLAM 63

keeps track of the uncertainty of the depth estimate for each mapped pixel, treating the depth

observation as a Gaussian probability distribution. This depth and uncertainty information is

stored in a data structure called a Keyframe, which contains, bundled with the pixel values of

a frame, depth and depth variance information as well. Finally, the most recent copy of that

depth information is the input of the tracking task to track the next incoming frame, closing

the information loop of simultaneous localisation and mapping.

The principles behind this process were presented in Chapter 2. For the rest of the thesis we

will focus on the elements of the algorithm and its software implementation which are relevant

to its performance and acceleration. This chapter will focus more on elements of the tracking

task important to the accelerator architecture presented on Section 3.4. Information regarding

the algorithm and implementation of the mapping task will be presented in a more complete

form in Chapter 5, with only some high-level information repeated here as necessary to discuss

the architecture, profiling results and the performance of the system.

3.2.1 Tracking in LSD-SLAM

In Section 2.4 we introduced an algorithmic overview of LSD-SLAM. This section will discuss,

from a high level perspective, the computations involved in the software version of LSD-SLAM

and introduce a pseudocode representation for the main calculations and data involved.

The core of the tracking algorithm is the iterative optimisation process presented in Chapter 2.

The pose is recovered by aligning the frame that is tracked with the current Keyframe and

attempting to minimise the variance-normalised squared error of the intensity differences. Every

valid 3D point of the map is part of the input. Utilizing the depth, variance and intensity values

of the map points, they are reprojected to the plane of the current camera frame to calculate

the photometric error and gradient of the reprojection.

The tracking task is composed of three main sub-functions, a residual and gradient calculation

function, a weight and normalising factor calculation function and a function that generates

the linear system for optimisation. The first one will be referred to as residual calculation, the


second one as weight calculation and the third one as linear system generation throughout the

thesis. The first two are essentially performing computations on the same data points, and are

responsible for the reprojection and error calculation mentioned above. They were separated in

the open source implementation available of LSD-SLAM but are always called sequentially, so

in designing an accelerator they were looked into as one task and combined in hardware in one

set of operations. The last one is called only for a subset of the iterative optimisation process

described in Chapter 2.

The control flow of implementing this iterative process of the Levenberg-Marquardt [48] opti-

misation is as follows. The error and gradients are calculated first, followed by the weighting

factors for the weighted least squares formulation. Optimising for a solution to the pose esti-

mate makes use of a second-order approximation of the error function. From that, a damped

linear system (before the solving of the system, a damping factor is added in the form of a

normalising constant subtracted from the main diagonal) is constructed in the final software

function. If its solution fails to decrease the error (which is verified from a re-run of the first

two functions which calculate the total weighted error), the damping parameter is changed and

the error calculation is executed again, without the task of linear system generation. If the

error is decreased, an update to the pose estimate is finally calculated and a new optimisation

step is generated from the new position. This process is summarised in Fig. 3.1.

The functionality surrounding the main computations only takes a few percentage points of

the computation time on a mobile CPU but would still require significant resources, especially

targeting a smaller FPGA, to be mapped to hardware. Furthermore, the computation in the

residual and weight calculation functions happens more often than the linear system function,

and the latter will be called after the other two finish and the final error is known. This led to

the decision for the work presented in this Chapter, presented in more detail in Section 3.4, of

designing a separate pipeline for each of these two main operations, and separate them from

the rest of the control. Thus the residual and weight calculation together are calculated in

one pipeline and the linear system generation happens in another. The rest of the operations

were left in software due to their low relative computational cost, identified with timing and

profiling results presented in the next section, and to avoid complex control being unnecessarily


Project pixels

using depth and pose estimate

Interpolate intensity and gradient for reprojection

error

Calculate weights (W), residual and

Jacobian vectors

Using a Taylor expansion approximation:

𝑯𝜹 + 𝒃 = 0 𝜹 = −𝐇−𝟏𝒃 ,

or 𝜹 = − 𝑱𝑻𝑾𝑱 + 𝜆𝑰−1𝑱𝑻𝑾𝒓 𝝃𝑛

Check convergence and adapt λ

Solve Linear Systemfor new

poseestimate

Apply new λ

Error decreased?

No

Yes

Residual and weight calculation functions

Generate linear system function

Figure 3.1: Control flow of tracking task. The three main sub-functions are described inside thedashed lines. The rest of the computation and control described in the figure happens outsidethese functions.

implemented in an FPGA fabric.

The process is repeated for different resolutions from small to large, for the pyramid processing

we have discussed in Chapter 2. Every step of the optimisation process across iterations and

pyramid levels is dependent on the output of the previous one. This led to the decision to

parametrise the accelerators in a resolution agnostic way. However, it still meant that the

computation for different pyramid levels could not be overlapped, but had to be performed

sequentially. The control of this process presents a large number of possible paths but is not

computationally demanding so the mobile CPU was chosen as a better fit to handle the higher-

level decisions.

3.2.2 The tracking algorithm

In the interest of brevity and clarity of presentation not all variables and calculations are

shown. Rather, the focus is on the details relevant to the research and discussion presented in

this thesis. All threshold variables are also potentially tunable parameters that can trade-off

quality and performance. As the focus of this research is on hardware design, the tuning of


these parameters to different robotic systems and environments is outside the scope of this

thesis, and orthogonal to our work. The variables and data structures used below are the same

that were defined in Section 2.4. The figure included in that chapter is copied here for reference,

as Fig. 3.2.

Mapping

Capture new

image

Secondary background functions

GUI

Loop closure detection

New Camera Frame

Tracking

Get tracking reference

Track Frame

Calculate Keyframe closeness

score

New Frame + Pose

Generate tracking

reference Update Keyframe

Create new Keyframe

Keyframe, max gradients and pose

Keyframe data structure

Distance score over threshold ?

NoYes

Unmapped tracked Framesqueue

Retire previous Keyframe

Graph optimisation

Keyframe graph

Figure 3.2: SLAM algorithmic overview. The main tasks are inside the two dashed rectangles.The main data structures are indicated with orange coloured boxes, while the light blue boxesrepresent the main tasks necessary to perform tracking and mapping. With grey we indicatesome of the background tasks involved in full SLAM.

This section focuses on the complexity of the track Frame function, and its dependency on the

amount of valid Keypoints as well as the current pyramid level and the system’s max resolution

for tracking. For the theoretical background and functionality of the tracking task one can refer

to Section 2.5. This section will only focus on the computations involved in implementing that

functionality.

The first sub-function, the calculate residuals function, calculates the values in equation 3.1,

and also calculates other intermediate values, such as the intensity derivatives dx, dy, for the

eventual construction of the sum of weighted, squared residuals and the linear system to solve,

both of which happen in separate functions, and buffers them in memory in a result buffers

data structure.


rp(p, ξji) := Ii(p)− Ij(ω(p, Di(p), ξji)) (3.1)

The calculate residuals function is presented with pseudocode in algorithm 2. The main part of

it is the calculations responsible for the reprojection of Keypoints to the new Frame’s projective

plan, leading to the residuals used in the sum of weighted residuals, in equation 3.2, used in the

optimisation process. It is always followed, after a check for divergence (early error estimate

significantly high), by the weight and residual calculation function. That function, presented

here as algorithm 3, uses the intermediate results of the residual calculation function to calculate

the overall weighting factor for each residual, integrating it in the result buffers data structure,

and calculates the final normalised error in equation 3.2.


∥∥∥∥∥w(p,ξji)r2p(p,ξji)

σ2rp (p,ξji)

∥∥∥∥∥δ

(3.2)

The total error calculation is repeated in the Calculate Linear System function once an incre-

ment is accepted. This last function, in algorithm 4, calculates the linear system to solve to

generate the update equation 3.3, without including the normalizing factor λI, that is added

outside. J in equation 3.3 is the stacked matrix of all Ji pixel-wise Jacobians and Wi,j is the

diagonal matrix of weights with Wi,i = w(residuali).

δξ(n) = −(JTWJ + λI)−1JTWr(ξ(n)) (3.3)

As we can see in algorithm 1, the linear system calculation is not repeated for every optimisa-

tion step. Rather, it happens for the first step, and then the increment is temporarily applied

to the pose. From that new pose estimate, the residuals and the error are calculated again from

algorithms 2 and 3. If that pose decreases the error it is accepted, and a new linear system

is generated to calculate the next increment from that position in the optimisation space. If

it increases the error it is rejected and a new normalization factor λ is applied to the pre-

viously generated linear system, essentially transitioning between Gauss-Newton optimisation


and gradient descent.

Algorithm 1 Track Frame Function

procedure TrackFrame(Keyframe, last pose, new Frame,K)#K is the intrinsic camera matrix

pose estimate = last posereference = Subsample and copy valid Keypoints(Keyframe) #Gen. tracking reference

for pyramid level = max level to min level do#Intermediate results, such as residuals and gradients are buffered in this function

(residual, result buffers) = Calculate residuals (new Frame, reference, pose estimate,K)Check for divergence (residual)(error, weights) = Calculate weights and residual (new Frame, reference,

pose estimate, result buffers)λ = 1while iteration < max iterations AND converged = False do

(A, b) = Calculate Linear System(result buffers) #First order Taylor approx.

while error not decreased doA = A+ λIincrement = solve (A, b)new pose = pose estimate ∗ increment #Exponential Mapping

Calculate residuals (new Frame, reference, new pose,K)Check for divergence (residual)Calculate weights and residual (new Frame, reference, new pose, result buffers)if λ ≤ 0.2 then

λ = 0else

λ = λ ∗ 12

#Tunable

if error < last error thenpose estimate = new poseerror not decreased = Falseif error/last error > conv threshold then

converged = True

elseif increment < increment threshold then

converged = True


Algorithm 2 Calculate Residuals Function

procedure Calculate Residuals(new Frame, reference, pose estimate,K)for all Keypoints in reference do #Keypoints with a valid observation

Reproject Keypoint to new Frame(reference[Keypoint]) # Use current pose estimate

Interpolate(I, dx, dy) # Subpixel interpolation for intensity and gradients

Calculate Intensity residuals()Calculate sum of squared values to buffer()Calculate fitness score()residuals[Keypoint] = residuals[Keypoint] ∗ fitnessResult buffer[Keypoint].add(residuals, squares, I, dx, dy)if fitness score < fitness threshold then

Result buffer[Keypoint].skip = True

#First warning of divergence comes from sum of squared residuals

residual = Sum(squared residuals) / number of good pointsreturn residual, number of good points, Result buffer

Algorithm 3 Calculate Weights and Residual Function

procedure Calculate Weights and Residual(new Frame, reference, Result buffer)points = 0for all results in Result buffer do

points = points+1error = Result buffer[Keypoint].residualvariance = reference[Keypoint].depth varianceCalculate weight(variance, error)Add pixel noise estimate to weight(weight)weighted error = Huber(error ∗ weight) #Huber Norm

summed error = summed error + weighted error ∗ weighted errorResult buffer[Keypoint].weight = weight;

return Result buffer, summed error/points

Algorithm 4 Calculate Linear System Function

procedure Calculate Linear System(Result buffer)for all results in Result buffer do

Ji = Calculate Elements of Jacobian vector(result)residual = result.residualweight = result.weightA = A+ Ji ∗ JTi ∗ weightb = b− Ji ∗ residual ∗ weighterror = error + residual ∗ residual ∗ weight

return A, b, error

Mathematical description of the maximum gradient calculation

Finally, we will briefly describe an important aspect of the mapping function, which is the

pixel selection to generate Keypoints. This will be discussed again in terms of computational


requirements in Chapter 5, but is included here to illustrate its effects on the reference data

structure that tracking uses extensively as a local copy of the map stored in the Keyframe. The

pixel selection is as follows, from the maximum gradient in the neighbourhood of a pixel with

coordinates (x, y), where I is the light intensity of that location.

dx = I(x+ 1, y)− I(x− 1, y)

dy = I(x, y + 1)− I(x, y − 1)

gradient(x, y) =√dx2 + dy2

max vert(x, y) = max( gradient(x, y), gradient(x, y + 1), gradient(x, y − 1) )

max gradient(x, y) = max( max vert(x, y),max vert(x− 1, y),max vert(x+ 1, y) )

Algorithm 5 Select pixels to map and use for tracking

for all Keypoints k(x, y) in Keyframe doif max gradient k(x, y) > gradient threshold then

result = attempt mapping(k)if result == successful then

k.valid = 1

3.2.3 Mapping and Global Optimisation

The pose estimation process of tracking, with the exception of an initialization phase that

happens only at the beginning, is done with respect to an existing Keyframe, a data structure

encapsulating depth and variance estimates and the frame with which they are made, as was

defined in Chapter 2. This Keyframe is used as input, but at the same time has to be con-

tinuously updated. This results in a constant synchronisation for the software implementation

which comes with a cost in memory traffic and latency for both tasks.

In addition to this, when a sufficient distance/disparity exists between the Keyframe and the

current camera frame, the Keyframe is replaced. At that point, the depth values need to

be propagated to the new frame. This is similar to the reprojection step that happens in

3.3. Profiling and Performance Analysis 71

tracking with the addition of adding a depth estimate at the location of projection in the

new Keyframe and increasing slightly the variance estimate. An extra normalisation step is

added to this process to ensure the average inverse depth value is 1.0 to avoid problems to do

with scale ambiguity, an inherent characteristic of monocular SLAM algorithms. During that

process, the tracking function needs to maintain the last valid copy of the map to function,

or alternatively wait out this costly process. In general, the accelerator for tracking has to be

able to co-operate and continuously synchronise its inputs and outputs in millisecond latencies

with multi-threaded functions operating on various buffers that are not guaranteed to be in the

same memory location, and occupy non-contiguous virtual memory pages. This can be one of

the key sources of delays in an accelerator’s real-world performance as will be discussed in the

evaluation section.

3.3 Profiling and Performance Analysis

The source code used for the analysis in this chapter is the open-source version provided by the

authors of LSD-SLAM for research purposes. It consists of highly optimized C++ code, with

some tasks in hand-written assembly utilizing vector instructions targeting modern multicore

CPUs. The task of mapping iterates over all pixels in the Keyframe serially and the calculations

for each pixel are independent from others. Thus, it can be parallelised on the loop iteration level

and in the software implementation it is mapped to multiple-threads processing different sets of

image rows in parallel. On the other hand, most of the functions involved in tracking need the

result of one loop iteration to proceed to the next, hence they need to execute serially. These

took advantage of hand-written vectorised routines, to attempt to maximise the attainable

performance. In total there were three versions for the computationally demanding functions

in tracking; one in standard C++, one utilizing SSE x86 instructions and one utilizing NEON

instructions targetting ARM CPUs.

As an initial step in the profiling of the implementation, the provided source code was compiled,

run and profiled on a desktop system. Eventually all versions were compiled with the maximum


optimisation enabled on the compiler as well as platform specific optimisations, such as using

the hard FPU unit on the ARM core tested in the next sections. The algorithm was timed and

profiled on a desktop PC using the vectorised implementation targeting an Intel i7-4770 CPU

and both the NEON and standard C++ implementations were tested and timed on a dual-core

ARM Cortex-A9. This was done to present highly-optimised software results in both high-end

desktop and mobile platforms and have a fair comparison point to the accelerators.

3.3.1 Profiling Tools

The main tool that used to obtain profiling results was Callgrind from the Valgrind suite [74].

This tool runs the executable in a controlled virtual environment where it counts the number

of CPU cycles spent on different instructions and matches them to lines of code using debug

symbols. It then aggregates these costs per function call, and can offer annotated code views

with assembly matched to C++ code. It also works well with multi-threaded applications which

was crucial for this implementation and it provides a good analysis of the computationally

intensive parts of an application. For this type of application it gives us a comprehensive view

of where performance is needed, and where the most computationally intensive parts are.

In the profiling results, tracking in LSD-SLAM was shown to account for the majority of the

total computation time together with mapping. Timing the different functions involved, it was

verified that most of the execution time was spread in the different components of the two

main functions encapsulating the tracking and depth map update tasks. Data collected from

the Valgrind / Callgrind tool is summarised in Table 3.1. Because of the multi-threaded nature

of the workload, and because the test cases run here included reading and decoding an image

from disk due to using pre-loaded datasets, the numbers do not perfectly reflect the importance

of each task to real-time SLAM but need to be interpreted in that light.

Firstly, the image input tasks would not appear in online SLAM and would be replaced with

loading an image from memory, copied there from a camera which would have a much smaller

cost. Furthermore, that cost can be overlapped with the tracking computation after the first

image streams in. As soon as the first image loads, tracking can begin operating on that while


the second frame loads in a different buffer. Thus, as long as both tasks are faster than the

target frame rate, and since they are completely independent, this will not affect the throughput

of the system. There are also background workloads included in Table 3.1. Some are part of

SLAM such as the pose-graph optimisation and some are not, such as the visualisation task.

These are not immediately dependent on the online result of tracking and mapping, and unless

we have a tracking failure, tracking does not need to communicate with them. As such, the

performance of these background tasks does not have to affect the performance of the live

odometry happening in tracking and mapping.

This has to be interpreted by taking in consideration the hardware utilized. Although in

domain-specific cases such as control and automotive, and ultra-low power applications target-

ting milliwatt grade power consumption one might still encounter single-core microcontroller

grade CPUs, most low-power application-grade CPUs these days have a minimum of two cores

and at the time of writing this thesis it is hard to find a mobile CPU even in a smartphone that

does not come with 4 physical cores as standard. Since this is the type of CPU that will be

targeted for even a sparse SLAM algorithm to achieve a real-time performance, as a microcon-

troller could not cope, this means that with split of 4 tasks, two for SLAM, one for handling

communication with the outside world and one for background optimisation, the tracking and

mapping tasks will have a physical CPU to themselves for the majority of the time.

As a result of the above, discussing the acceleration and the framerate of SLAM, one has to

focus on two metrics. Firstly, the latency of executing each task on software or in a dedicated

hardware accelerator. But secondly, it becomes increasingly important in a data-intensive

application such as this how well the two live and interdependent tasks, tracking and mapping,

can overlap and share data and their respective throughputs. This is because in software, and

as we will discuss in the next chapters in hardware as well, they will take a comparable amount

of time to execute. As a result, their running serially instead of overlapped would effectively

mean a significant reduction in performance, since the processing of one frame through both

would have the latency of the two tasks combined instead of the latency of the slower task.

If we focus on the computation time that is necessary for real-time SLAM and ignore costs of a


Task or task group Percentage of computation

Frame Tracking 31%

Depth Estimation 33%

Image Decode and Input 10%

Frame Creation 2%

Graph Optimisation and other tasks 9%

Visualisation and Graphics 10%

Table 3.1: Profiling Results - Callgrind / x86 Intel CPU

graphical user interface (GUI), visualisations of data and image file decoding, we end-up with

a split that is heavily focused in the two main tasks, tracking and mapping, each composed

of a few interdependent functions. The data can then be normalised to discuss percentages in

relation to the computation time for tasks necessary to perform SLAM. The resulting normalised

percentages are approximately 44% for the mapping task and 41% for the tracking task, with

the rest attributed to background tasks. This is visualised in Fig. 3.3.

Mapping

44.4%

Tracking41.3%

Background Tasks

14.3%

Figure 3.3: Percentage of total computation for the tasks comprising SLAM


3.3.2 Timing Results

Guided by the profiling results, the execution of key functions was timed on both the desktop

and the embedded CPU. This was done to complement the profiling results by providing a

runtime performance figure and to provide a point of comparison between an embedded platform

and a high-end desktop machine for this type of application. In tables 3.2 and 3.3 we can see a

summary of average execution time for the x86 and the ARM processor respectively. Execution

times of different sub-functions were measured and were combined under the name of two

respective tasks. The main takeaway from timing all the internal sub-functions is that most

of the computation load was split between different loops operating on arrays of elements and

that they all needed to be addressed to accelerate the tracking and mapping tasks effectively.

One of the key characteristics of the application is the large amount of data movement for these

functions. Something that came up in the timing tests is that the ARM CPU’s performance was

more significantly impacted by the large amount of memory transactions than the Intel desktop

CPU. One strong indicator was the relative drop in performance for the randomly accessed

frame in the centre of the residual calculation function. This is attributed to differences in

cache size and memory subsystem design complexity that mean the desktop CPU can hide

more of the memory system’s latency. Lastly, the ratio of time spent per function varied due

to the much wider multiply pipeline of the desktop CPU that is better able to cope with some

functions employing a large amount of parallel independent multiplications, such as the linear

system update function.

These results establish the importance of accelerating the tracking and mapping tasks in order

to enhance the performance of this algorithm in the embedded board. They also show that

SLAM, as is the case with many computer vision algorithms, is characterised by large and

non-trivial data movement and memory and caching strategies should be a high priority in any

platform designed for such algorithms.

At the time that the work presented in this chapter was submitted for publication in a peer-

reviewed conference, the comparison was made to the ARM Cortex-A9 that was available on


Task Execution Time Mean

Tracking task (hand-coded SSE) 12.3 ms

Map update task (4 threads) 14.2 ms

Table 3.2: Timing Results - Intel i7 - 4770 @ 3.77 GHz

Task Execution time mean

Tracking task 440 ms

Tracking task(NEON acceleration) 378 ms

Map update task - (1 thread) 1 Second

Map update task - (2 threads) 576 ms

Table 3.3: Timing Results - ARM Cortex-A9 @ 667 MHz

the FPGA board for testing. This demonstrated the gap between mobile CPUs, with power and

weight characteristics to fit in emerging embedded applications, and the performance required

to process semi-dense SLAM in real-time. The Cortex-A9 on the FPGA was running a full

Linux-based operating system, based on Ubuntu 12.10. In Chapter 5 we will also discuss using

a more recent mobile CPU with higher absolute performance, number of cores, and power

requirements, the ARM Cortex-A57, using a newer Ubuntu distribution (v. 16.04). However

the same conclusion still stands. In this type of application the required performance and

latency cannot be satisfied simply by advances in general purpose hardware, especially in the

light of Moore’s law slowing down. In this context, this work shows that for this application

a custom hardware solution can provide a significant improvement in performance per watt

and an absolute performance level that can tackle this challenge. In the rest of this chapter, a

deeply-pipelined architecture will be discussed as a candidate towards achieving this goal.

3.4. Accelerator architecture 77

3.4 Accelerator architecture

3.4.1 System architecture and control

In this section, an FPGA-SoC is the base device for an architecture developed to accelerate

key functions of LSD-SLAM. As presented in Chapter 2, this type of device offers a power

efficient mobile CPU tightly integrated with programmable logic fabric on the same chip. The

design was developed around the fixed memory system parameters that the off-the-shelf devices

provide. For the family of devices discussed here, there were four available “High-performance

Direct Memory Access ports” (HP - DMA) connected to the DDR controller allowing for direct

memory requests and high-speed burst transfers to be initiated directly from master controllers

implemented on the FPGA fabric. These can operate at a maximum of 150 MHz, with a bus

width of 64 bits, and contain small buffers that allows some combining of operations. The

custom hardware is also connected to a general purpose (GP) port, again through an AXI

interconnect, directly to the ARM CPU. Communicating through the GP port, the accelerator

on the FPGA acts as a slave to enable high-level control and parameter set-up from the software

running on the CPU.

The system architecture is shown in Fig. 3.4. Two accelerators were implemented in the work

presented in this chapter to offload the functions involved in the majority of the computation

time for the tracking task. They are presented in the figure as the “Residual and Weight

Calculation Unit” and the “Jacobian Update Unit”. The first is responsible for calculating

the map point re-projections, their photometric residual and the interpolated image gradient

on the projected location and the individual weighting factors for the error function. These

outputs are then processed in the Jacobian Update Unit, which generates the linear system

representing the weighed least squares optimisation.

Both the CPU and the custom hardware have access to the same memory space which helps

avoid redundant copying and simplifies communication. However, during execution of the

algorithm, for each camera frame the software transfers the data to be processed to a dedicated

area of the off-chip memory, reserved for accelerator-CPU communication. The reason for this


Figure 3.4: System Architecture

copy is that the input data of these functions comes from interfacing functions to the semi-

dense map, stored in two software buffers. These are scattered in multiple virtual pages in

the Linux OS-managed memory making memory accesses more costly in hardware, since an

address translation is needed for every page (4096 Bytes) of data accessed, and unless that page

is locked in memory, that process has to happen continuously. To alleviate this, they are copied

in this dedicated area where contiguous memory is allocated. The memory is IO-mapped at the

OS kernel / CPU level making it non-cached and avoiding a costly step of flushing the entire

L1 and L2 caches after writes from the accelerator or the CPU to synchronise the two. The

alternative has been well studied, and would be adding a translation circuit and optionally a

translation buffer on the FPGA synchronised with the OS. This would however complicate the

design significantly and generate its own overheads in access time, without offering significant

research value which led to the first option being preferred for this work.

Following the copy step, the embedded CPU sets different control signals on the reconfigurable

logic such as image dimensions, a previous pose estimate and of course the hardware pointers for

the dedicated data stores in the off-chip DRAM and calls the accelerators as a direct replacement

to the software functions. The tracking software thread executing on the CPU then waits for

a response from the custom hardware while other software threads perform depth mapping


Mapping

Tracking Accelerators

Time

…

Control and Synchronization

Accelerator 1 –Residual and

Weight Calculation

Mapping

Accelerator 2 –Jacobian

Update Unit

…

Frame Cache Update

Frame Tracked

Pyramid L. 4 Level 3 Level 2 Level 1 Level 4 Level 3 Level 2 Level 1

Figure 3.5: Accelerated tracking and mapping execution in software

and global optimisation in the background. The accelerator may be called multiple times per

pyramid level without any further copy step required, since the allocated memory for the frame

contains valid data and is simply reused. This reduces the impact of the frame copy.

In Fig. 3.5 we can see a simple example of how this process might work. The top line represents

an example timeline of tracking two frames, one after the other, over a pre-computed map, while

in blue we see how mapping would execute simultaneously with some cost to copy information

over and synchronise different software threads. First there is the copy and control steps

mentioned above. Then the accelerator updates the frame cache if it is the first time operating

on this pyramid level. In the scope of the accelerator, the term cache is used to describe an on-

chip memory storing a copy of the current camera frame to improve random access performance.

The size of this cache is always equal to the entire frame due to the nature of the accesses.

Finally, we see a different amount of calls towards the two hardware accelerators, until a level is

changed. The width and number of different executions is not to scale but meant to showcase

the execution flow of calls and synchronisation. The cost of execution and synchronisation

overheads will be further evaluated in Section 3.5.

The second accelerator has as its inputs the values produced by the first accelerator, stored


in a part of the dedicated DRAM region. The decision to operate like this was because the

respective function for the second accelerator is not called for every error calculation. Instead,

the linear system is calculated once at the beginning of the optimisation process and then is

recalculated for every new estimate that actually reduces the reprojection error. Therefore,

matching that behaviour in hardware meant executing the second accelerator after the first one

finishes. Since the generated data until that point are too much to store in an on-chip cache,

and are accessed sequentially so a cache would not have a big impact in performance, they

are offloaded to the off-chip memory. Hence, after the first accelerator returns control to the

CPU, the software compares the new error and then optionally calls the second accelerator for

a linear system re-calculation using that output as its input.

A note on number representation

The custom hardware was designed to maintain the accuracy of the software functions it re-

placed, and be a drop-in replacement in the software implementation, interfacing with the rest

of the algorithm executing on the ARM CPU. Hence, most calculations are on single-precision

floating-point, with some intermediate results stored as double-precision. Designing custom

hardware opens the way to using custom-precision numerical formats. In Chapter 4 we discuss

how for example fixed-point was used to store and process pixel information and their gradients.

The pixels need variables that store decimal digits in tracking to preserve information since we

are processing subsampled images as well.

In general, there are many points in the system where the high-dynamic range of a floating

point representation is necessary, especially for operations in the accelerator of Chapter 5. A

half-precision floating-point format was also considered. However, in the accelerators presented

in this and the following chapter for tracking, a half-precision floating point was not explored

as it was not supported at the time of the system development by the tools. In the accelerator

presented in Chapter 5, a newer version of the tools was used to test a design utilizing half-

precision floats (16 bits) for units that could potentially afford the reduction in accuracy.

However, the first iterations gave a resource cost not significantly lower than the single precision


float due to the cost of conversions to-and-from higher precision formats and resulted in larger

latency for some tasks.

Overall, for the work presented in this thesis the focus is in the exploration of novel architectures

for the system that potentially had a more significant impact and arguably a higher research

value at this point towards bringing the field of advanced state-of-the-art SLAM algorithms to

low-power embedded platforms. The use of custom precision, a well-researched topic in custom

computing, was left as a possible future exploration to further optimise resource utilization

once the architecture was set.

3.4.2 Residual and Weight calculation Unit

This unit implements the functionality of two software functions. It is responsible for the

projection of map points to the camera frame and the calculation of the error, photometric

gradients and weights that result from this comparison. The values of these gradients and

the intensity need to be calculated for a point with the image plane coordinates (u, v) that

result from the re-projection, as demonstrated in Fig. 3.8. To estimate a sub-pixel version of

these quantities a linear interpolation method is employed that utilizes a window of 4 pixels

around that point and weighs their contribution based on the distance to the 4 surrounding

pixels. The implementation of this will be discussed further in Section 3.4.2. Its functionality

is demonstrated in Fig. 3.6. Based on the floating point coordinates (u, v) a weighting factor

is calculated for the four pixels surrounding a point, estimating an interpolated value for its

intensity. The same weights are used to interpolate the gradients dx, and dy as will be discussed

further in Section 3.4.2. Based on these values, the error and the weights necessary for the

optimisation process,are calculated and streamed out. These values will be needed to calculate

the total residual of the estimated pose. The overview of the architecture of this hardware unit

is presented in Fig. 3.7

The first task is to re-project the reference point from its 3D coordinates (recovered from the

Keyframe’s pose and the estimated inverse depth) to the current camera plane based on the

current pose estimate. This involves a matrix to vector multiplication and a vector addition.


(u, v)

𝐼 𝑢, 𝑣 = 49% + 21% + 21% + 9%(x, y)

(u, v) = (x + 0.3, y + 0.7)

Figure 3.6: Intensity value interpolation for a projected point using its 4 neighbouring pixels

Figure 3.7: Residual and Weight Calculation Unit

The projection results in the new coordinates, x’, y’ and the new depth (z’) from the frame

of reference of the camera pose. They are then divided by the depth z’ and multiplied by

the intrinsic camera matrix, according to the pinhole camera model, with an optional final

correction according to the lens characteristics to give the corresponding coordinates on the

image plane. The results of this is a set of two coordinates corresponding to the actual captured

image frame, (u, v), as shown on Fig. 3.8.

These are passed to the interpolated element block which, as described before, returns the

interpolated result for the gradient dx, dy and the intensity. After the interpolated element

block returns the three results the residual calculation unit then proceeds to complete the

computation. The first step is calculating the actual residual, which is the difference between the

interpolated intensity at the current frame, and the intensity of the keyframe at the projected

coordinates. Then we proceed to accumulate five sums. The sum, and sum of squares, of the

interpolated intensity value (∑c and

∑c2), the sum and the sum of squares of the keyframe


World Point P (x, y, z) w.r.t. Keyframe

OROL

Projected Point (u, v)

Map point Pi

KeyframeCurrent Camera Frame

Figure 3.8: Tracking involves projecting a map point, with a recovered inverse depth 1/z tothe image plane in the current camera frame

pixel intensity (∑c2 and

∑c2

2), and the sum of a local weight factor:

weight =5

|residual|, if |residual| < 5 and 1 otherwise.

Finally the weight calculation takes place. This was originally a separate function in the code

but was included in this unit since it results in a more efficient design, reusing all the temporary

results, before they are written to the off-chip DRAM. At this stage the weights and final error

are calculated, normalising with the estimated depth variance of the map point.

From a high-level perspective, the module is organised in three pipelined hardware blocks.

One block handles pre-fetching batches of inputs, one block writing back batches of outputs

and the main calculation block performs all of the necessary computation for the residual and

weight calculation. Before the unit begins work, if this is the first step of this pyramid level,

the entire frame being tracked is pre-fetched in a local frame buffer. This is done because

pixel gradients, required as part of the algorithm, are accessed in a random pattern, and would

require extensive unordered memory read requests. In the software implementation these values

would be calculated before tracking begun and stored in the DRAM as a floating-point buffer.


Figure 3.9: Residual Calculation Pipeline - Pixel Re-projection

Instead, by pre-fetching pixel intensity information and calculating these quantities on the fly,

there is a significant reduction in memory bandwidth and total latency for the computation at

the cost of extra hardware adders.

The first section of the pipeline of the residual and weight calculation is depicted in Fig. 3.9,

and corresponds to the projection of a map point to the image frame coordinates (u, v). For this

set of computations, there are a total of 9 multiplications and 6 additions. These are pipelined

into three hardware float multiply units and two adders. Targeting a processing interval of 6

cycles to match the latency of the random accesses from the frame cache of the system allowed

a more relaxed allocation of resources in this section. The frame cache, also discussed as part

of the system architecture, contains during computation the current camera frame to guarantee

low latency access for the random memory requests during the residual and weights calculation.

The result from these multiplications and additions is the point coordinates from the new frame

reference. These are then fed into one divider, then one multiplier and finally an adder, all

of which are also time shared, to produce the final pair of coordinates (u, v), that forms part

of the input of the Interpolated Element Unit, described in the next subsection. The main

pipeline was not aggressively optimised for throughput since the main bottleneck at this point

was at the communication with the off-chip memory 1.

After the Interpolated Element Unit calculates the intensity and gradient values for the pro-

jected point, the rest of the pipeline calculates the different sums described in the beginning of

the section necessary for the weighted least squares optimisation. At this stage there are 4 mul-

tiplier units utilized to perform these calculations. One is dedicated for the squared sums, but

1In the next chapter, lessons carried from this work enabled us to test and utilize several optimizations tosignificantly improve the pipeline’s throughput figures.


the other three, together with one of the multipliers depicted in Fig. 3.9, are time-shared with

the multiplications in the weight calculation block. There are also present two float division

units which are also shared with the weight calculation block.

In total the weight calculation section of the pipeline performs 18 float multiplications, four

divisions, two additions, two subtractions and one accumulation. Most of its units are time

shared with the residual calculation as mentioned. There are also a float square root unit and

an absolute value unit. The result of this function is a weight factor corresponding to the

current residual.

After this computation finishes, seven results are buffered to be written back to the off-chip

memory. The first three are the re-projected x’, y’, and z’ coordinates, intended for use from the

second accelerator to use. The residual for this tracked point, and the interpolated gradients

dx and dy are also buffered and written back. The unit buffers all of these results in a set of 6

Block RAMs. Another Block RAM is used to buffer the resulting weight factor for the residual

from the weight calculation which is the 7th value to be written back.

Some of the allocation and sharing of hardware units is a result of the automatic scheduling

in the High-level Synthesis tools that were used to design the presented accelerator. For some

cases, such as here, the synthesis flow can find design points that are more efficient than a

straightforward hand-built pipeline, in a fraction of the development time required to sched-

ule these operations manually using traditional HDL languages. However, this is achieved by

attempting to reuse units across the pipeline for different operations, and can result in con-

nections that are less organised than the output of a human designer and hence sometimes

unintuitive and harder to depict in a clear way. For this reason, from this part of the thesis we

will focus more on resource usage and corresponding throughput and less on the topology of

the hardware except when that topology is significant to the presented architecture.

The first design explored as part of this work issued memory transactions directly from the

main pipeline calculating the residual and weight values when new values were ready to be read.

However this was found to suffer significantly from the memory system’s latency, resulting in

a low performance. This was then converted to the system presented in Fig. 3.7, using buffers


to convert memory transactions to bursts. The I/O blocks prefetch batches of 50 map points

(5-vectors) in a 250 word buffer and then wait for the residual and weight calculation to finish

before writing out 7 bursts of 50 word outputs, each corresponding to an internal buffer for

that output array, implemented in BRAMs.

Let us discuss first the choice of an image cache and the recalculation of gradients as needed.

Testing a configuration on the board with a hardware pipeline calculating the gradients on

the fly as opposed to issuing transactions to the off-chip memory, the hardware pipeline’s

throughput was increased by more than a factor of two by using the frame cache and performing

the extra computation. Firstly, the size of the data loaded is smaller since only pixel intensity

information is loaded, avoiding the transfer of an additional two floating-point gradient values

per pixel. Secondly, by performing one large burst read from memory the overall latency cost

is orders of magnitude smaller in comparison to multiple random access requests. This design

also improves power efficiency, despite the extra multiplications, since off-chip memory requests

which are more costly in terms of energy are reduced.

This strategy of using an on-chip frame cache for the current camera frame was not replicated

in the case of mapping information, which forms the rest of the input for the tracking task.

The comparative size of the map is much bigger than the camera frame, but the main reason

is that the mapping information is accessed sequentially, making caching less important since

burst transactions to memory can be used to improve the efficiency of the memory system. A

first, straightforward configuration was designed that would generate these sequential requests

to memory while accessing the map information. It turned out that the behaviour of that

system suffered a large performance bottleneck from the memory system latency, revealed after

simulation during the physical board tests. Because the hardware generated by the tools would

not overlap multiple memory requests and the latency for replies from the high-performance

ports was greater than anticipated the accelerator was operating at a much reduced rate. A

later modification led to the system that was published and is presented here, where the unit’s

inputs and outputs are read and written as buffered bursts of 50 vectors rather than sequentially

for every iteration. This takes advantage of the increased efficiency of burst reads and writes

and reduce the overall cost of the memory access latency.


Finally, this combined with the on-chip frame cache improved the performance of the hardware

units by three to four times in terms of accelerator throughput. The combination of these

strategies however proved insufficient in practice to provide the necessary performance for real-

time operation as a significant memory bottleneck remained and the pipeline was underutilized.

The main reason is that the communication and computation still could not be overlapped

without a complete redesign of the pipeline, which took place in the following work presented

in Chapter 4 utilizing all the lessons learned here. The accelerator presented in this Chapter

executes a burst read, saves it at an intermediate buffer and then processes the first 50 data

points (or 250 words). Then, it streams out the results, and prefetches another set of 50. The

points are not always divisible by this number so the remained would be processed by the same

pipeline, in a sequential configuration with a lower throughput. The performance effect of this

is small since the problem size is larger than 20,000 points, so the only significant cost of this

was a small increase in resource usage for the extra control logic.

Interpolated Element Calculation

This function interpolates three quantities between 4 neighbouring pixels. The horizontal gra-

dient, the vertical gradient and the pixel colour. The aim is to give sub-pixel accuracy to the

tracking process. The interpolation uses the floating-point coordinates (u, v) corresponding

to the image plane. The integer part of theses is used to calculate the memory offset for the

pattern in Fig.3.10, and then the decimal digits are used to weigh the interpolation process

between the 4 pixels.

This unit performs four sets of reads from the pixel cache, stored in two buffers implemented

with registers that store different line segments referred to in Fig. 3.11 as line buffers, for a total

of twelve pixel intensity values. These twelve pixels are used for the interpolation process. There

are four centre pixels whose intensity is interpolated as was demonstrated above in Fig. 3.6 and

for which we calculate four sets of the two gradients dx and dy. For each of the four sets of

gradients we need access to 4 immediate neighbours, hence a total of 16 pixels, which overlap

as shown in Fig 3.10. The top left pixel used for the intensity interpolation is the one with


coordinates (floor(u), f loor(v)), shown in dark grey in the figure.

Figure 3.10: Pixel Gradient and Intensity Interpolation

The gradients for a pixel with coordinates (u, v) are calculated as below and interpolated linearly

in the same way as the intensity values:

dx =1

2(intensity(u+ 1, v)− intensity(u− 1, v))

dy =1

2(intensity(u, v + 1)− intensity(u, v − 1))

A high-level view of the hardware unit implementing this functionality is shown in Fig. 3.11.

In the figure, the Fetch Pixel Region hardware generates addresses and accesses the on-chip

frame cache to retrieve the 4 line segments comprising the 12 values shown in Fig 3.10. Then

two values are read each cycle from the buffers, and are fed into a series of multipliers and then

add/sub units to produce the final interpolation results at a rate of one set per 5 cycles. The

thicker lines in the figure indicate the transfer of multiple values per cycle, while thinner lines

are wire connections. The hardware units inside the rounded rectangles were time-shared across

many different operations, with the state machine at this part switching multiple multiplexers

and control signals to manage this. In this figure some of these connections and multiplexers

are omitted for clarity of presentation.

The operations performed are 6 float subtractions, 12 float multiplications, and 11 float addi-

tions in total. The overall initiation interval was used to guide resource allocation in this unit

as well. The result of these calculations is a set of three elements, dx, dy, and an interpolated


Figure 3.11: Element Interpolation

intensity C, fed back into the rest of the pipeline.

3.4.3 Linear System Generation - Jacobian Update Unit

This second accelerator, originally called the Jacobian Update Unit in the published work,

calculates the derivative values for every single reference pixel and uses them, as in the system

of equations below, to calculate the Jacobian as well as the Hessian approximation H ≈ JTJ .

It then integrates these in a linear system, the solution of which will provide the next update

step for the tracking task.

The accelerator reads back the last version of the buffers containing the results of the other

residual and weight calculation (running in the first accelerator) when the linear system solution

is actually accepted as one that reduced the error. These contain the x’, y’, z’ dx, dy, weight


and residual values for each reference pixel that was tracked. These are used to calculate the

6 values for a Jacobian that is reduced to a row vector, since our function has a scalar output:

f : R6 → R. The 6 elements of the Jacobian are calculated as follows:

J =

J(1) = dxz

J(2) = dyz

J(3) = − xz2dx− y

z2dy

J(4) = −xyz2dx− z2+y2

z2dy

J(5) = z2+x2

z2dx+ xy

z2dy

J(6) = −yzdx+ x

zdy

Then the Hessian is approximated as the multiplication of the Jacobian with itself with the

added weight factor. The linear system is generated by accumulating these results for each

reference point i, with :

A = A+ J(i) · JT (i) · w

b = b+ JT · residual(i) · w

In total this accelerator contains twelve floating-point multipliers, one divider, three add/sub

units, and three dedicated floating-point adders. The initiation interval was set at seven hard-

ware cycles to match the limit of reading a set of seven inputs per iteration from a single

port that can only schedule one word read per cycle. This allowed a simpler accumulation for

the Jacobian vector, which introduces inter-cycle dependency. Again the allocation of units,

scheduling and time-sharing was chosen from the tool guided from pre-processor directives and

mainly the ones setting the targeted initiation interval.

3.5. Evaluation 91

3.5 Evaluation

The SLAM System was implemented and tested on an Avnet Zedboard carrying a Xilinx Zynq-

7020 SoC. This includes a dual-core Arm Cortex-A9 processor running at 667 MHz with a Xilinx

FPGA on the same chip and an off-chip memory, a DDR3 512MB RAM. The programmable

logic used contains a total of 85K Logic Cells, 4.9Mbits of BRAMs and 220 DSP blocks available.

For evaluation, the design was synthesized and placed-and-routed with Vivado HLS and Vivado

Design Suite (v[2015.4]), targeting the development board, on which it was eventually run and

tested. Running on the ARM Cortex-A9 is an Ubuntu-based Linux OS which supports the

software, as well as custom drivers to interface with the hardware accelerators. The bootloader,

OS and all relevant file systems were placed and run from an SD card.

The first step for the evaluation was to port the LSD-SLAM algorithm to that system and

resolve library compatibilities, often having to compile libraries for the ARM core separately.

Next, it was run as pure software to get timing information for different tasks and functions.

These were compared with the profiling results to confirm our conclusions and guide further

the accelerator design. Then, a user-space driver was written to interface with the FPGA

accelerators, at which point both the software and the hardware version were run in parallel in

the same system to check their behaviour. SLAM is inherently a family of algorithms dealing

with probabilistic estimates based on imprecise data. As such, in this work we focused on

achieving results equivalent with the open-source software implementation that was proven to

work as expected on real-world datasets both on the published works [17] and run our own

machines for verification. After verifying that the hardware accelerator’s output was the same

as the output of the software functions, the full SLAM algorithm was run with the hardware-

based version of tracking.

The FPGA was run at a clock frequency of 100 MHz, to meet the timing achieved by the

tools. At that frequency the system achieved on average a total frame processing time, in-

cluding the cost of memory copy and some bookkeeping on the software side, of 218ms per

frame. This corresponds to a performance of on average 4.55 frames per second for our main

test scene, tracking at a resolution of 320x240. This resolution is typical for state-of-the-art


direct SLAM, and as mentioned in Engel’s work [17], tracking stops at one pyramid level lower

than the camera frame’s original resolution, while mapping uses the full resolution to obtain

depth measurements. For the exact same test sequence the software-only version, without the

NEON extensions, running on the dual-core ARM A9 achieved 2.27 frames per second while

the use of the hand-coded assembly routines raised that to 2.6 frames per second. The tracking

performance, frame-by-frame of both the software and hardware versions can vary with the

amount of map points present in the scene.

More complex scenes require more operations and both the hardware and software versions vary

in running time approximately linearly with the number of points. The average performance

results are given on figure 3.13 where this difference becomes evident. In the figure, the lower

bars represent a particularly complex sequence with a higher ratio of map points to total pixels.

In the centre of the figure, the vectorised version with inline NEON assembly is included as a

fair comparison to the best achievable performance with this general purpose CPU platform.

Examples of two scenes that would generate maps with a different number of points are in

Fig.3.12. The scene on the right contains fewer areas of high-texture that are mappable and as

a result less computation time.

Figure 3.12: Different scenes will generate Keyframes with fewer or more mapped points, whichaffects the runtime both in Hardware and in Software

The FPGA accelerated system was consistently 2× faster for a variety of scenes compared to

the pure software version. The benchmarks used are included in Table 3.4, and were provided

by TUM [75] as example sequences for LSD-SLAM. This is a promising result, and it is only

3.5. Evaluation 93

constrained by memory bandwidth, and not by computation performance. This performance

was achieved with an estimated dynamic power consumption of 0.71 Watts for the FPGA, and a

chip total of approximately 2.25W. The power consumption of the software-only version without

the vector instructions was estimated at 1.54 Watts for 50% average core load, as a conservative

estimate. Even with these figures the FPGA can perform the computationally intensive task

of tracking at an estimated 6.40 frames/s/Watt in comparison to just 1.48 frames/s/Watt for

the CPU. The power figures for this chapter where based entirely on estimates given by the

Xilinx Vivado tools. In Chapter 5, power monitoring was included for the development board

to compare different platforms with greater accuracy.

2.27

2.6

4.55

1.92.2

3.84

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

ARM A9 ARM A9 with NEON Hardware Accelerated

Frames / Sec

Figure 3.13: Average performance in frames per second for two different sequences selectedfrom the datasets in Table 3.4. The grey colour corresponds to a more dense, complex scenewhere a higher number of points are mapped and used for tracking. Examples of two scenesthat would generate maps with a different number of points are in Fig.3.12.

For the FPGA used in this chapter, the utilized resources were 54% for the DSPs (119 DSPs)

and 44.7% for the Flip-Flops (47.5K), as demonstrated in table 3.5. However the current design

is not well utilized. A rough estimate on available performance based on the multipliers and

adders generated is estimated to be approximately an order of magnitude higher. It turned

out that this architecture could not cope with the memory traffic and the memory system’s

latency. Additionally, the design of a deep pipeline to combine all the necessary computation


Dataset Duration Resolution Description

Desk Sequence 0:55min 640x480 @ 50fps Indoors, small clut-tered space

Machine Sequence 2:20min 640x480 @ 50fps Outdoors, combina-tion of near and far-away objects

Table 3.4: Datasets used, provided on TUM’s website by the authors of LSD-SLAM

Estimated by Tool LS. Update (J) Res. &Weight Unit Post-Implementation

DSPs 48 78 119

BRAM 0 73 73

FFs 14630 32328 47570

LUTs 24902 41383 40298

Table 3.5: FPGA Resources. The first two columns represent the resources post synthesis,where the tool has allocated a certain number of resources at each instantiated hardware unit .Post-implementation, Vivado uses various optimisations to reduce usage by combining circuitsor simplifying various units.

was expected to improve utilization by allowing the automatic scheduling algorithms of high-

level synthesis to time-share hardware units more effectively. Instead, possibly due to the

immaturity of the tools but also due to the nature of high-level synthesis frameworks, the

tool couldn’t schedule many operations effectively in this fashion, and instead generated more

overheads when a pipeline was longer and more complex.

As we will discuss in the following chapter, designing smaller hardware units that are con-

nected to each other through buffers, where crucially the tool cannot determine the input and

output schedule but treats them as a latency of 1 operations, turned out to be more effective

as it allowed us to manually tune particularly critical pipeline paths for better latency and

performance characteristics. The most major redesign that gave a performance breakthrough

in Chapter 4 however was the conversion of the design philosophy to a dataflow paradigm

which suited this type of application much better. We will discuss this in further detail in that

chapter.

3.5. Evaluation 95

520 1221.11

4602.69

38939.1

0

5000

10000

15000

20000

25000

30000

35000

40000

Level 4 Level 3 Level 2 Level 1

μSeconds

Figure 3.14: Memory transfer cost in microseconds to synchronise the map and frame with thehardware buffers on the DRAM. This happens once per pyramid level.

As it stands, this design achieves more than 4.5 frames per second, including some time spent

on data movement. The measurements for the cost of data copies are shown for each level on

figure 3.14. Trying to interface custom hardware with software that uses large dynamically

allocated buffers brings a penalty due to the cost of synchronising a large amount of data.

These buffers are also stored in virtual pages, scattered in different places in the physical

system memory. It turned out that the amount of input data that has to be copied to a

dedicated area of the RAM will take a significant percentage of the total execution time, up to

25% at this performance level.

The per-function performance of the hardware accelerator scales linearly based on the number

of points, with a small extra penalty on the final level due to some additional reads and writes

necessary to support a feature of the original software implementation, namely a reported pixel

quality metric. This is shown in figure 3.15. In that figure, Residual and Weights refers to

the calculation of the photometric residual and the corresponding weights in the first hardware

accelerator, two different functions in software, combined here in a sum for easier comparison.

Linear System Update refers to calling the Jacobian update unit to calculate the Jacobian and


0

5000

10000

L E V E L 4 L E V E L 3

L E V E L 1

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Residual And WeightsFPGA

Residual And WeightsSoftware

Linear System UpdateFPGA

Linear System UpdateSoftware

μSeconds Level 4 Level 3 Level 2 Level 1

Figure 3.15: Software and hardware timing for individual functions. These are run multipletimes until the error stops decreasing significantly, at which point the process is repeated forcoarse-to-fine pyramid levels until the penultimate level (once subsampled from the originalimage) is reached

Hessian, and then solving the linear system. It is important to note that the Residual and

Weight Calculation unit is called more often than the Jacobian Update unit, while the memory

copy mentioned in the previous paragraph happens only once per pyramid level.

The pipeline for the residual and weight calculation has an interval of 6 cycles, which translates

into a throughput of approximately 16.6 Million points per second. However to be able to

achieve that throughput the input data rate would have to be about 300MB/sec and crucially,

the output rate would have to be more than 420MB/sec. After several hardware architectures

were designed and tested on the development board it turned out a good design point is

to perform one larger burst transfer for 50 reference points (a total of 250 words or 1000

Bytes), perform the computation on them caching the results locally to allow the pipeline

to function as fast as possible and then write the results in a series of 7 successive burst

writes back to the DDR. The accelerator updating the linear system has a similar bottleneck

from memory performance and latency. The technical reference manual indicates that ideally

the high performance DMA port on the SoC can sustain bursts of 255 words, transferring

3.6. Conclusion 97

one word per cycle at a max frequency of 150MHz, or 600MB/sec sustained, so if a steady

stream of data could be supplied, that would mean approaching the theoretical limits of the

port. What this shows is again is the high communication demands characteristic of SLAM

algorithms. The conclusion from this is that a complete redesign of the way the data is stored

and accessed is a necessary hardware/software co-design step to allow any hardware accelerator

to function efficiently. Otherwise the cost of data synchronisation will reduce performance gains

appropriately according to Amdahl’s law.

The latency and acceleration results discussed in this section demonstrate that this hardware

design approximately doubled the performance and gave an improvement of more than 4× the

performance-per-watt compared to a low-power embedded CPU. This doubling in performance

at that power level means that at a power budget smaller than that of a mobile CPU the

algorithm can follow movement that is twice as fast for the robot/platform with the same

tracking accuracy as before, or alternatively provide more accurate tracking estimates and

as a result a more accurate reconstruction at the same platform velocity compared to the

software version. In Figures 3.16 and 3.17 the impact of running through the same dataset,

but having to drop frames due to a reduction in performance of 2 − 4× is demonstrated. A

minimum level of performance for a certain platform is necessary to avoid very large errors and

potential failure in tracking due to fast movement. However, further improvements past that

minimum are important as they can provide a larger improvement in quality and accuracy for

the same algorithm. For example in Fig. 3.17 a 2× reduction in performance leads to visible

deformation of recovered map surfaces, reduced detail on some structures and a significantly

larger accumulated trajectory error on the map at the left of the figure.

3.6 Conclusion

Custom hardware on an off-the-shelf FPGA is capable of bringing advanced and more dense

SLAM algorithms to embedded low-power devices. However, different bottlenecks in the pro-

posed architecture prevented the platform from reaching its potential. A hardware/software


Skipping Frames due to performance constraints

Adequate performance for the movement speed

Figure 3.16: In a moving platform or camera, the performance of SLAM will directly affect theaccuracy of tracking and the accuracy and quality of the reconstruction, and in extreme caseslead to tracking loss. The map on the right is on a system that can deliver a 4× improvementin performance compared to the left. The map on the left has accumulated a very large errorfrom skipping frames due to performance constraints

Higher performance leads to more information processedReduced drift and error accumulation for pose and map

Figure 3.17: Even if the slower system can keep up, improved performance is crucial to improvequality and accuracy under fast movement. The map on the left was recovered on a systemwith a performance deficit of 2× compared to the one on the right

3.6. Conclusion 99

co-design step is necessary since the CPU-optimised algorithms and implementations are not

well suited to an accelerator. A hardware design should place a lot of importance in mem-

ory architectures, including caching techniques and data movement with overlapping memory

accesses and computation to ensure a robust high level of performance. With these communica-

tion requirements in mind, as well as the nature of FPGAs, which function at a lower frequency

but can perform a large number of operations simultaneously for every cycle, a streaming in-

terface is ideal, where a partial redesign of state of the art algorithms would create a sequential

data movement from one point in the pipeline to another. If a redesign of the front end ensures

that data passes through the FPGA at a high sustained rate, even a small FPGA-SoC such as

the one used in this work can achieve real-time performance for the task of tracking inside a

small power budget of less than 3Watts.

The following chapter discusses exactly such an architecture, designed from the ground-up with

many different domain specific optimizations around a dataflow design. As we will discuss in

Chapter 4 it achieved a sub-10ms latency for all the iterations comprising one tracking step.

Chapter 4

Accelerating Tracking for SLAM

The previous chapter presented an investigation on semi-dense SLAM fo-

cused on LSD-SLAM in modern high-end desktop machines and embedded

platforms. The computation patterns and memory and performance require-

ments where discussed and profiling and timing results were used to guide

the acceleration of key methods. A straightforward implementation for an

accelerator in an FPGA-SoC showed the potential performance that could

be achieved by these platforms but revealed bottlenecks and limitations that

reduced the peak achievable performance. In this chapter, we present a high-

performance design for a custom accelerator targeting the task of tracking

in semi-dense SLAM. Taking advantage of the lessons learned in Chapter 3,

the design presented here focuses on providing an efficient, high-performance

solution for direct tracking using a high-bandwidth streaming architecture,

optimised for maximum memory throughput. At its centre is a Tracking

Core that accelerates the necessary computation involved in non-linear least-

squares optimization for direct whole image alignment at a much higher per-

formance per watt compared to general purpose hardware.

100

4.1. Motivation 101

The architecture is designed to scale with the available resources in order

to enable its use for different performance/cost levels and platforms.

Evaluated, it achieves real-time performance on an embedded low-power

platform, without any compromise in the quality of the result.


Bouganis [Boikos Konstantinos et al., FPL, IEEE 2017 [19] ]

4.1 Motivation

SLAM is composed of two interdependent tasks, strongly tied together, tracking and mapping.

In an unknown environment the reliable output of the first is needed to start generating the

second and vice-versa. The task of tracking is the first component to address since it has the

strictest latency requirements for processing and dropped frames. A low framerate and high

latency for tracking will result in an autonomous robot accumulating errors during tracking and

mapping and hence a much slower and less robust operation, with a strong drop in performance

potentially resulting in a tracking failure where the pose cannot be recovered at all. In addition,

profiling results, presented in Chapter 3, showed that together with mapping, tracking accounts

for the majority of the computational requirements of real-time visual SLAM.

As discussed in the previous chapter, embedded grade CPUs, with power and weight character-

istics to fit in emerging applications in robotics, IoT and augmented reality, lack the required

performance to process Semi-dense SLAM in real-time. At the same time, this field is still char-

acterized by significant volatility, with algorithms that change significantly every few years. The

use of FPGAs will allow efficient heterogeneous architectures to be explored and used in the

field while, reducing the cost of changes to follow the state-of-the-art in the software. Custom

hardware accelerators mapped onto an FPGA-SoC acting as co-processors to a mobile CPU

can achieve both the high performance and low latency requirements of these applications while

at the same time significantly increasing power efficiency for this task, making them a good

candidate towards overcoming this challenge.

102 Chapter 4. Accelerating Tracking for SLAM

In this context, this chapter presents a custom hardware architecture with a combination of

dataflow processing and high-bandwidth memory interfaces that provides a high frame-rate, low

latency solution for direct photometric tracking for SLAM using a monocular camera. For this

chapter, mapping is performed in software based on the work by J. Engel et al. [17]. To the best

of our knowledge, this is the first high performance hardware architecture for direct photometric

tracking for SLAM on an FPGA-SoC. A high level of sustained performance is targeted, to

address the latency requirements of the application together with low-power consumption, to

enable the embedding of this type of application in low-power, lightweight robotic platforms

such as quadcopters, fixed-wing drones and household robotics as well as for augmented reality

applications.


• To the best of our knowledge, the first design for a high performance power efficient

accelerator architecture for direct photometric tracking for SLAM on an FPGA-SoC.

• An investigation on factors that affect the performance of this type of accelerator, the

peak performance that can be achieved by an off-the-shelf FPGA-SoC and bottelenecks

that can arise when used as a coprocessor next to a mobile CPU for this application.

4.2 Architecture

4.2.1 System Architecture

The system architecture is designed with an FPGA SoC in mind, a System-on-Chip that con-

tains a mobile CPU and an FPGA fabric tightly integrated. The CPU and FPGA share the

same off-chip DRAM, and they operate on the same memory space. Throughout this chapter,

the part of the accelerator that performs the computation on the input data will be discussed

as the “Tracking Core”. The inputs of this core are firstly a current camera frame, accessed

in a random fashion during operation, the current recovered depth map, accessed sequentially,

4.2. Architecture 103

and some input parameters for the algorithm such as the current pose estimate and the di-

mensions of the frame being tracked at the current pyramid level. As explained in Section 2.5,

in Chapter 2, tracking is performed multiple times for different resolutions from coarse to fine

to improve convergence performance, so the current input frame has a subsampled version for

each different pyramid level. Its output is a linear system, the solution of which provides a next

optimisation step, as will be discussed further on the next section.

To support these accesses on a system level, the following interfaces were used. The Tracking

Core on the reconfigurable fabric is connected directly to the off-chip memory controller through

two dedicated high performance ports (HP ports), where the accelerator acts as a master,

providing a high throughput and low latency stream of reads from the DRAM in a similar

fashion with the system on Chapter 3. This type of interface is crucial for the high bandwidth

requirements of this type of application. The custom hardware is also connected to a general

purpose (GP) port, essentially a peripheral port to the ARM CPU. Communicating through

the GP port, we enable high-level control and parameter set-up from the software running on

the CPU, where the CPU acts as a master. An important differentiator in the system level to

the work presented in the previous chapter is the increased efficiency in communicating through

the HP ports. In the accelerator presented in this chapter, dedicated input units, decoupled

from the processing pipeline, provide high-bandwidth streaming access to the off-chip memory

at the full bus width and contain large buffers to improve the memory access performance and

efficiency.

An important aspect of this architecture is the movement of data inside the core, as well as

to and from memory. As this architecture is not meant to be device or resolution specific, the

memory interfaces are designed to surround a computation core with high-bandwidth streaming

access that can alleviate the bottlenecks on communication and future-proof the system so that

the hardware’s utilization can be close to 100%. Thus, performance is dependent mainly upon

the capabilities of the computation core and can scale for different FPGAs or a fully-custom

design, avoiding the pitfalls and bottlenecks discussed in Chapter 3. As we can see in Fig.4.1,

the communication with the off-chip DRAM, is routed through two dedicated hardware ports.

Data reads are handled by two input blocks each with its own FIFO buffers, to allow the input


system to absorb memory inefficiencies and delays. Having dedicated ports means that no

CPU time is consumed to route traffic to and from DRAM. Additionally, going directly to the

memory controller from a dedicated port means that the custom hardware avoids contention

with most of the memory traffic software running on the ARM CPU generates, and this shorter

path has a much lower latency to initiate a transaction.

There is also an output block in the core, with its own master controller, to handle writing

the output back to DRAM after the core has finished processing all elements. This one shares

one of the ports with the Input units above, since the ports can be operated in duplex mode.

The writes happen after all points are processed and are orders of magnitude fewer than the

read. Hence, the read bandwidth is not affected from this topology. All the ports are designed

to execute transactions using the full bus width in burst transactions, with custom units in

the FPGA unpacking data on the fly at the input and output to translate to and from the

original word width. One of the biggest bottlenecks in the work presented in Chapter 3 was the

latency from initiating to completing a memory transaction. By queueing long burst requests

of hundreds of words, buffering the resulting reads and overlapping all I/O operations with

computation, the presented architecture results in a core that can leverage very high read

bandwidth from the memory system, in order to be able to support even the most memory

demanding algorithms.

Throughout this core, the communication between hardware blocks is done through FIFOs,

as seen in Fig. 4.1 and Fig. 4.2 following the dataflow paradigm. This enables the different

hardware blocks to function independently from each other, beginning processing as soon as a

data point becomes available and blocking if any stream becomes empty. This decouples the

internal tracking pipeline from the input blocks, and by configuring the input blocks at a faster

rate than the rate of consumption further down the pipeline, the gaps that can arise between

memory requests due to memory traffic and the latency of the memory system can be hidden.

This way, the utilization of the hardware resources of the pipeline is significantly improved since

they spend less time waiting for data to arrive from memory. They are designed to operate at

a slower rate, using fewer resources, but with a high utilization (100%).


Figure 4.1: System Architecture

4.2.2 Direct Tracking Core

The tracking core is designed as a set of hardware blocks that operate independently in a

dataflow paradigm, with matching processing rates, connected by sets of FIFO buffers. Instead

of explicit control signals most units start or pause processing depending on the availability and

type of data. Further taking advantage of this, functionality is split into smaller self-contained

units that are easier to design and optimise using high-level synthesis language. The hardware

is also designed to scale in terms of resources. By changing the target processing rate of the

pipeline’s components, and with minimal modifications, the proposed architecture can target

a larger FPGA or be scaled down.

Different parts of the pipeline can utilize a larger number of arithmetic units to schedule more

operations per cycle, or re-use fewer units to trade off processing throughput with resource

use. Especially in light of the tools used for design space exploration in this work, such an

architecture improves exploration of more efficient or high-performance design points by taking

advantage of the strengths of high-level synthesis and the strengths of an experienced designer.

High-level synthesis offers benefits such as automatic scheduling and hardware unit placement


which can give a quick solution, optimising for the lowest latency for a given set of operations.

Separating the functionality to smaller, self-contained units, naturally leads to simpler and

more efficient designs while making the problem size for each individual unit manageable for

the designer to identify better opportunities and override the automatic placement with a

hand-picked solution where necessary.

The inputs for the core, streamed from the main memory, are in the form of a five-element

vector for each Keypoint, with the reads split in a set of three and two elements for the first

and second input unit respectively. Each Keypoint is a point of the recovered depth map that

has a valid depth estimate as described in Chapter 2. These five elements are the estimated

coordinates relative to the Keyframe of a Keypoint (since only the valid Keypoints of the map

are read by this accelerator), the depth estimate, the depth variance and the intensity of that

point as observed by the camera. Additionally, for each Keypoint, a 4x4 window is needed

from the currently tracked frame. The entire current camera frame is preloaded on the FPGA

once for every pyramid (resolution) level, in a partitioned memory described in Section 4.2.3, to

avoid the off-chip memory latency which for successive random accesses would quickly become a

very significant bottleneck. In this work it was demonstrated that performing some redundant

computation, in order to significantly reduce memory traffic, leads to higher performance but

can also improve power characteristics owing to the larger energy cost of transferring data

off-chip compared to performing processing on the fly.

In Algorithm 6 a high-level pseudocode representation of the tracking task is demonstrated, as it

was implemented in hardware. With the exception of some high-level decisions and the solution

of the final linear system that were allocated to the mobile CPU, most of the functionality of

the pose estimation is implemented in the reconfigurable hardware. The inputs are the current

camera frame at the resolution of the current pyramid level, an initial pose estimate and the

Keypoints from the recovered map, while the output is the current error for this pose estimate

and the Gauss-Newton linear system to solve for the next optimisation step. The accelerator

pre-computes the linear system update and streams it out to the external memory in case it

will be required in the next step of the algorithm.


Algorithm 6 Tracking, hardware accelerator - Calculate next optimisation step

procedure Tracking Accelerator(new Frame, tracking reference, pose estimate,K)

#Happens for every new camera frame and pyramid level

Update Hardware Frame Cacheif New Camera Frame Captured then

Update Current Camera Frame

for all Keypoints in reference do #Keypoints with a valid observation

Stream-in (reference[Keypoint]#Implemented as deep streaming pipeline

Reproject Keypoint to new Frame(reference[Keypoint]) # Use current pose estimate

if Coordinates are valid thenCalculate interpolated intensity I and gradients dx, dy

elseSignal next pipeline units to skip

Calculate Intensity residualsCalculate squared values and sums

#Interleave accumulated values in multiple register banks

Calculate fitness scoreresidual = residual ∗ fitnessif fitness score < fitness threshold then

Signal next pipeline units to skip

#Directly calculate weights and error

variance = reference[Keypoint].depth variance#Cached

Calculate weight(variance, residual)Add pixel noise estimate to weight(weight)weighted error = Huber(error ∗ weight) #Huber Norm

summed error = summed error + weighted error ∗ weighted error#Pre-compute Linear System

Ji = Calculate Elements of Jacobian vector #From registered results

#Interleave accumulated values in multiple register banks, improving

# pipeline performance. Re-combine on exit

A = A+ Ji ∗ JTi ∗ weightb = b− Ji ∗ residual ∗ weighterror = error + residual ∗ residual ∗ weight

return errors, num successful points, A, b


Image Frame Projection

This hardware block is responsible for projecting a stream of reference points with coordinates

(x,y,z) from the current Keyframe to the plane of the camera frame whose pose is currently

being estimated. The projection is done based on the current estimate of the frame’s pose using

the pinhole camera model and the principles of multi-view geometry, described in more detail in

Chapter 2. The block finally generates a pair of floating-point image frame coordinates (x’,y’)

that are streamed to the Sub-pixel Element Interpolation. The most significant differences

for this block with the design in Chapter 3 are the stream interface, dataflow paradigm and

optimising the calculation patterns to allow deeper pipelining and a more aggressive schedule

for a significantly higher throughput.

The first of these optimisations has to do with accumulating different sums, a significant part

of the tracking algorithm. For every accumulation there is a 5-7 cycle latency for the adder

to produce a final result. To speed-up loops containing accumulations a template hardware

accumulator was designed that would cycle the outputs and inputs depending on the desired

throughput to different registers, creating in effect partial sums. At the end of the loop the

partial sum registers are added in a final step to generate the full accumulation sum. Since the

cost of that final step is in the tens of cycles and the main loop runs for tens of thousands the

added latency is insignificant but the throughput bottlenecks are lifted. Other optimisations

that are discussed in the next subsections include introducing fixed point for sub-pixel opera-

tions in the place of floating-point and lifting various pipeline bottlenecks to achieve a higher

overlapping of loop iterations and as a result higher throughput.

Sub-pixel Element Interpolation

This block receives the stream of floating-point coordinates (x’,y’), belonging on the image

plane of the current frame, generated from the Image Frame Projection block. Its functionality,

similar to the interpolated element calculation in the previous Chapter, is first to find the block

of 4 pixels surrounding that planar point and calculate their intensity gradients dx and dy. It


(x’, y’, z’)

(u, v, control) (I, dx, dy, valid) Intermediate Results

Keypoint Position

Information

Keypoint Colour and Variance

Figure 4.2: Tracking Core Architecture

then proceeds to calculate an interpolated value for the intensity and the gradients of the point

(x’,y’), using the values of these 4 pixels, based on the decimal digits of the coordinates. The

output is a stream (I, dx, dy) of the interpolated values for the intensity and gradients. However,

as with the previous block, the implementation has changed to allow a lower latency and high

throughput design. This involved transformations of the computation and strategically forced

registers in the tool to allow a better pipelining strategy at the cost of extra resources.

The gradients for a pixel with coordinates (u, v) are calculated as :

dx =1

2(I(u+ 1, v)− I(u− 1, v))

dy =1

2(I(u, v + 1)− I(u, v − 1))

To calculate these gradients, this block needs to fetch a 4x4 window of pixel values for every

iteration, with an address and offset generated on the fly. The memory partitioning and window

access pattern, that will be explained in more detail in Subsection 4.2.3 are designed to allow low

latency access for these 16 values, with the option of scaling to a single-cycle access configuration

at the cost of more memory ports and blocks. Moreover, this block also generates a stream of

boolean values, with each value paired to a tuple of outputs (I, dx, dy), indicating if the image

coordinates are within the frame boundaries and therefore valid. This boolean values are used

further along the pipeline to indicate a skip for a point along the weight and L.S. calculation

path. Here is another key difference owing to the dataflow paradigm, which in this application


significantly improved the architecture’s performance and optimisation opportunities. If the

coordinates are invalid the pipeline keeps running, with a pre-set safe memory address and the

output is safely ignored in the next block.

Finally, this unit utilizes fixed-point computation, to store the subsampled image pixels and

for the immediate interpolation steps after that. At the end of the calculations a converter

stores the final results in floating-point registers for the next unit that uses them. This reduces

significantly the cache sizes, as well as the register use of this unit. However, the larger latency

of the fixed-point arithmetic units compared to the floating-point units using the tools available

necessitated some extra resources to maintain the same throughput. As mentioned in Chapter

3, in this work the focus was mainly in generating a novel high-performance architecture and

the custom-precision opportunities that required further investigation were left as a possible

future exploration to enhance resource efficiency once the architecture was finalised.

Error Calculation

This block receives the output streams from the interpolation process and additional infor-

mation forwarded from the Image Frame Projection unit. The block is responsible for the

calculation of different error metrics, the weights and residuals necessary the generation of the

linear system to be solved and the estimation of the total photometric error of the current pose

estimate. The input streams include the (x,y,z) coordinates of the reference points, re-projected

from the frame of reference of the new frame, the interpolated intensity and gradients streamed

from the Interpolation block and finally a stream of the intensity and depth variance of the

Keyframe reference points.

The first set of outputs of this block are the residual and a weight factor for each Keypoint, that

will be used along with the (x,y,z) coordinates and the gradients in the next block. It is also

responsible for calculating the error and squared error accumulated sums from the reprojection

of each Keypoint, utilized later in the algorithm, as well as sums for the pixel intensities, and

point quality characteristics for each Keypoint that is being tracked. All of this data is utilized

at the linear system generation, with parts of it also exported at the output block (not visible


in Fig. 4.2) as quality metrics for the latest pose optimisation step.

Another key characteristic of these hardware units, including the Linear System Generation

unit of the next subsection, is that they deal with a large number of accumulations. To achieve

a sustained high throughput and low latency for these, the presented architecture stores results

as partial sums in a number of distributed registers to account for the latency of the floating-

adder pipeline as mentioned in the Image Frame projection subsection. After all Keypoints

have finished streaming through, a final accumulation step combines the registers to obtain the

final result. This way more multiply/accumulate units can be placed and function at a higher

utilization for an improved throughput.

Linear System Generation

Finally, this block receives and combines the Keypoint with the results of the previous blocks in

order to generate the Linear System later solved in software to provide the next pose estimate.

The processed information includes the (x,y,z) Keypoint coordinates intensity and gradients

as well as the error and weights that have been calculated. The solution step, as well as the

comparison of the error metrics and the update, are comprised of orders of magnitude fewer

operations compared to the error and weight calculation and the linear system generation and

they only happen at the end of the processing. The comparisons and branches comprise tens

to hundreds of cycles on the mobile CPU which are in the order of a few microseconds, while

the solution of the linear system for a 6x6 matrix using LDLT decomposition in the Eigen

library takes less than 0.25 milliseconds. As such, the resource cost of adding these operations

in hardware is not justifiable, and instead these operations and some high-level control is left

to the software running on the CPU, accounting for less than 2% of the runtime of the tracking

algorithm.

The functionality of this block is to generate the 6x6 matrix A = JTJw and the vector

b = Jrw, essentially the linear system to be solved as part of the optimization method described

in the Background. As matrix A is symmetric, only 21 out of 36 elements need to be computed.

Taking this into account, the intermediate results were stored in a partitioned memory of size


21. Depending on the target processing rate, copies of each element register are created and the

accumulations are interleaved so partial sums are stored in each of them. This memory shares

the same distributed register architecture such as the one mentioned in the Error Calculation

block, to allow single-cycle throughput for the accumulation operations. At the end of the

computation, these are added and repacked into matrix form to be written back to DRAM.

This unit is characterised by a large number of multiply and add operations per map element.

As a result, a large percentage of the total chip resources was allocated to this unit to satisfy

a requirement for a larger number of computational units to achieve a high point throughput.

4.2.3 Frame Cache Partitioning

The Sub-pixel Interpolation block requires access to a 4x4 window for each tracked valid map

point. An architecture is presented here for the frame cache to achieve low latency access to

windows of this size (4x4). This is then generalized for a window of any NxM size with M being

a multiple of 4 and N any integer. This mapping is optimised for an FPGA due to the type

and structure of the memory blocks available on this type of device. In most modern FPGAs,

memory is mapped to blocks of SRAM of thousands of bits with each block offering independent

read/write ports. In the devices we target here, 18/36Kbit Block RAMs are available, with

each block offering two independent ports that can operate synchronously. Thus, by organising

these blocks in different configurations one can access multiple pairs of elements simultaneously

provided they are guaranteed to exist in different Block RAMs.

Using the assumption that the image width is a multiple of 4, an assumption that holds in most

widely used resolutions in the literature, the image elements (pixels) are stored in memories

partitioned with a factor of 4 in a cyclic fashion. Thus, the memories are organised as 4 sets of

Block RAMs with each successive pixel cycling between the available sets. In this configuration,

pixels 0,4,8 ... will belong to the first memory set, pixels 1,5,9 ... to the second and so on. Using

the 2-port BRAMs available in this family of FPGA devices, eight successive elements can be

accessed in a single cycle from four sets of BRAM organised in this way. Then, successive

image rows can be stored again in different sets of memories, repeating this cyclic pattern in


the vertical dimension in order to have the ability to perform parallel accesses for successive

rows. Combining the two, neighbouring pixels in a window can be accessed simultaneously by

using different memory sets in parallel.

This partitioning is demonstrated in Fig. 4.3. There are two levels of partitioning, the first one

dealing with storing successive image rows and the second with storing successive pixels in the

same row. A memory partitioned in this fashion for N row-sets and M column-sets results in

M ×N sets of memory with 2×M ×N independent ports for read and write access.

Address Generation

Unit

Memory #1

Memory #2

Memory #N

Memory #2.1 Memory #2.2

Memory #2.3 Memory #2.4

Mem. #2.(M-3) Mem. #2.(M-2)

Mem. #2.(M-1) Memory #2.M

Figure 4.3: Frame Cache Architecture

For the case of a 4x4 window, with N=2 and M=4 the address generation is performed as

follows. For each row, a base address is derived from the address of the first pixel on the left,

defined as the pixel index ∈ [0, (width× height)/2− 1], by setting the two least significant

bits (LSBs) as 0. This address, and the address base+ 1 which is calculated by simply setting

the least significant digit as 1, are sent to the memory banks in the current row as in Fig 4.4

while the two LSBs, defined as the offset, are used for the symmetric 4-to-1 multiplexing to

select the required values. Since we are operating in this case with a cyclic partitioning over

four memories, as we can see in the figure, this simplifies multiplexing and address generation

significantly compared to a non-multiple of 4 pixels per row. This partitioning also offers

improved write performance in light of the off-chip read bandwidth. The ports to the off-chip

DRAM read sets of 8 successive 8-bit pixel values as a 64-bit word per cycle, so the ability to

store 8 pixel values per cycle offers the best utilization of that bandwidth to lower the latency

of a frame-cache update.

Lastly, this assumption of a multiple-of-4 pixels per row allows calculating the address of succes-


sive rows of the window by replicating this hardware and memory partitioning, and generating

the address for following rows by adding multiples of the row width as necessary. While any

window height could be supported using a modulo operation to cycle between multiple memo-

ries depending on the row index, a multiple of two is recommended as this again simplifies the

multiplexing to a simple case of using the least significant digits since the modulo operation

in hardware can be quite expensive in terms of resources. In the 4x4 case for example, two

memory sets were implemented, each half the size of the camera frame, to store the odd and

even rows respectively. Therefore, depending on the row index of the top-left corner of the

window, 4 base addresses are generated and sent to the sets of memories, to produce 4 sets of 4

pixels with a latency of two cycles to match the performance of the rest of the Direct Tracking

Core pipeline. This latency can be reduced or incremented as necessary trading off memory

ports and the resource cost for extra multiplexers.

at Index + 2

Offset

Pixel at Index

Pixel Index: 0 0 0 0 1 0 1 0

0 0 0 0 1 0 0 00 0

Base Address: Offset: 1 0

at Index +1 at Index + 3

M #1.1 M #1.2 M #1.3 M #1.4 M #1.1 M #1.2 M #1.3

Base Base Base Base Base Base Base+ 1 + 1 + 1

Figure 4.4: Frame Cache Architecture

Moreover, it has the added benefit that when expressed in high-level synthesis as C, it allows

the tools to infer this independence and generate the correct hardware and scheduling. In this

author’s experience, the tools do not guarantee correct operation in such cases and need a

specific manner of expression to produce the expected result. For example, instead of using

base directly with a bit mask, one needs to express base as integer division (index/4) which will

still result in a bitwise operation after optimisation but will first allow the underlying compiler

to infer the independence between the read operations and therefore the correct schedule.

4.3. Evaluation 115

This takes advantage of the fact that since offset<4, the 4 sequential elements of each row are

guaranteed to be in the range: [(index/4), (index/4) + 6]. Thus, seven values are loaded in

local registers instead of four, starting from the address [index/4] which will always be aligned

on a boundary of four, and therefore always lie in the first memory set. Then by using the

value of offset, the correct four elements are selected and stored in four registers to be used in

the next part of the pipeline. In the case of FPGAs and high-level Synthesis tools, this design

utilizes extra memory ports (this has a lower cost in FPGAs compared to ASICs due to the

mapping of that memory to multiple 2-port BRAM blocks by default) to achieve a low-latency

access for the desired window.

4.3 Evaluation

In this chapter the experiments for evaluating the performance of the architecture were con-

ducted on the same board utilized in Chapter 3, carrying a Zynq-7020 FPGA-SoC [76]. Later,

the scalability of the architecture was investigated by targeting a Xilinx ZC706 Evaluation

board that offered significantly more resources to explore designs across a bigger spectrum of

resource constraints. The board setup for both was similar to the one already described in

detail in the previous chapter, with most of the differences focusing on altering the software to

allow a more efficient operation and some improvements to the drivers handling communication.

LSD-SLAM was compiled and tested as software, and then the core functionality of the tracking

task was replaced with a function that utilizes the accelerator presented in this Chapter. The

communication is achieved through user-space drivers that were designed as part of this work,

using a direct slave interface from FPGA to CPU set up through the Xilinx Vivado toolchain.

The design presented was implemented and synthesized using Vivado HLS and Vivado (v.2015.4).

It should be noted at this point that during the design of the architecture presented in this

Chapter and the following one it was not straightforward to map the intention of our designs to

what the tool allowed. A lot of scheduling and performance issues were overcome with careful

placement of resources and manual isolation of false dependencies, with some even requiring


forcing the tool to disable standard optimisations. The conclusion from this is that while they

allow faster design-space exploration, these tools lack maturity, contain many bugs and are not

yet ready for a complex production-level design.

In the scalability evaluation a newer version of the tools (v. 2016.1) was used to target the

ZC706 Evaluation board. The reprogrammable logic on that device contains 54,650 Logic Slices,

with 218,600 LUTs 545 BRAMs for a total of 19,620KBit, and 900 DSP units [2] allowing for

deeper more complex designs with higher throughput and larger caches. Due to the resource

constraints of the FPGA fabric on this smaller chip a design point with a throughput of one

Keypoint per 3 cycles was selected for the experiments from the design space we have examined.

On this board, the post-implementation resource usage was 80.8% for the LUTs (43028) and

80% for the DSPs (176). It also required 78.93% of BRAMs (110.5) and 54.84% of FFs (58352).

A frequency of 111MHz was achieved on this board. The estimated dynamic power consumption

of the SoC for this configuration is 2.7 W for the CPU and FPGA combination.

4.3.1 Custom Core Performance

At this design point and frequency the acceleration achieved by the core was 40× in comparison

to a hand-optimized software version running on the dual core ARM-Cortex A9 of the same

board, as shown in Fig. 4.5. This corresponds to a processing time of 9.49 ms in comparison

to 383.9 ms for the software version. When accounting for the cost of copying the data to

be processed to a dedicated area of the DRAM and the setup cost, the total latency for all

iterations increases to 18.34 ms shown in Fig. 4.6, with the acceleration coming to 21×. This

penalty is due to the fact that in the evaluated version, the Keyframe and Keypoints reside

in software/OS managed virtual memory, as the mapping task is still executed as software.

Therefore, for every map update, the new values of the Keypoints have to be copied to the

FPGA managed memory for the next camera frame to be tracked.

In the published article relating to this chapter, the reported latency is the 18.34 ms figure with

the memory copy cost of synchronising the Keypoints included. However, when evaluating

the accelerator as custom hardware designed to work as part of a domain-specific system, the

4.3. Evaluation 117

1 10 100 1000 10000 100000

Level 4

Level 3

Level 2

Level 1

microseconds

Processing Time per Pyramid Level (Log Scale)

Hardware Software

Figure 4.5: Processing Time Per Level

important figure is the processing time at less than 10 milliseconds. For a hardware track /

map pipeline, where the estimated map generated by the mapping task does not need constant

synchronisation with software threads as the Keyframe and Keypoints can permanently reside

in FPGA managed memory, this data copying ceases to be necessary. As such, this design

working alongside a mapping accelerator on a common memory space could achieve frame-

rates at up to 100 frames per second at this resource cost. At this performance level, Fig. 4.7

demonstrates the achieved acceleration over the low-power ARM Cortex-A9 and the accelerator

of Chapter 3.

4.3.2 Resource Scaling

Fig. 4.8, shows post-implementation resource usage for different performance points, ranging

from 1 Keypoint per 4 cycles to 1 Keypoint every 2 cycles. The scaling results are provided with

the same design, with Vivado now targeting the larger Zynq ZC706 board. The most significant

factor in obtaining increased performance is adding floating-point arithmetic units which mostly

affects DSP usage, since the bottleneck currently is not on memory bandwidth. This is due to

the wide and fast-operating input stages. However, if one wanted to continue the scaling trend


1 10 100 1000 10000 100000

Level 4

Level 3

Level 2

Level 1

microseconds

Total Time per Pyramid Level (Log Scale)

Hardware + Memory Copy Software

Figure 4.6: Total Time per Level including Memory Copy

additional or faster-operating memory ports would become necessary as the current bandwidth

becomes exhausted. For example, the input currently provides 2 Keypoints per 3 cycles at the

current configuration of a 64-bit input unit dedicated to the (x, y, z) input vector of 32-bit

values, so a pipeline that could process 1 Keypoint per cycle would be approximately 33%

underutilized.

The DSP block usage scales almost linearly with the required throughput, with some addi-

tional LUTs and FFs required for the arithmetic units and to provide more pipeline registers.

The reason the apparent gradient of LUT utilisation is lower compared to DSP utilisation,

is that LUT usage is split between computation units and registers, whereas DSPs are used

almost exclusively for floating-point math. The design implemented on the Zedboard is the

one corresponding to a throughput of one Keypoint per 3 hardware cycles, because of DSPs

availability.

Although the principles we have discussed remain the same, points after the 0.5 points per

cycle were not synthesised as part of this work for two reasons. Most importantly, as has been

mentioned before, the next available design point would require a duplication of the majority

of resources to reach the step of 1 Keypoint tracked per cycle, but at that point the bottleneck

4.3. Evaluation 119

2.54.345

22.5

0

10

20

30

ARM+NEON Accelerator in Chapter 3 This work

Performance - Frames per second

Figure 4.7: Performance comparison of this accelerator with our previous work presented inChapter 3 and with the NEON-accelerated software on the ARM Cortex-A9 as baseline.

would shift to the memory system which can currently fetch 2 Keypoints every three cycles.

That duplication, with a widening of the memory design to enable higher bandwidth would

result in a design that would require the majority of the resources available on the reconfigurable

fabric of off-the-shelf FPGA-SoC devices. That would push resource usage to a point where a

second accelerator for the mapping task of SLAM would not fit on the same FPGA fabric for

off-the-shelf FPGA-SoCs. Moreover, when this functions alongside a mapping accelerator as

planned, the 0.5 Keypoints per cycle performance point would already extend the performance

of this accelerator to a level already multiple times higher than the one used by the authors

of LSD-SLAM in practice [1] and at this point should be able to cover quite a large array of

applications as we have discussed in Chapter 2.

4.3.3 Running as part of a SLAM Pipeline

The two ARM cores on the Zynq are running a distribution of Ubuntu Linux as the operating

system. This provides the necessary libraries for the open source version of LSD-SLAM to

compile and execute. That version was subsequently modified so that the tracking function,

with the exception of some control, the solving of the Levenberg-Marquardt linear system and

the copying of some data, is completely taken over by the FPGA core. The DRAM area used


0

5

10

15

20

25

30

35

40

0 .25 0 .33 0 .5

%

KEYPOINTS PER CYCLE

RESOURCE USAGE ON ZC706

LUT LUTRAM FF BRAM DSP

Figure 4.8: Performance/Resource Scaling targeting a larger FPGA

by the FPGA is marked as an I/O peripheral making it uncached and unbuffered. This allows

simpler sharing of data between the two cores and the FPGA though it can reduce the efficiency

of writes and reads to this memory from the Linux-based OS.

However the effect is less significant in our case since mostly streams of sequential stores are

performed which rely more on the throughput performance of the memory system than the

access latency. The factor that is more significant than the use of memory that is unbuffered

on the mobile CPU’s side is that the mapping task is still on software. That means that for every

map update the updated map has to be copied to the dedicated DRAM area of the accelerator

as mentioned in Section 4.2. In fairness to related work we report the performance taking this

copy step into consideration, but this is not indicative of the actual peak performance of the

design, and can be avoided once the map data is fully migrated into the dedicated DRAM area

with a custom hardware accelerator for that task as well.

Running a full SLAM system, the performance achieved for the entire tracking function was

on average 22.7 frames/s. This was tested on a sequence of camera frames from the “Desk

Sequence” provided by the authors of LSD-SLAM [17] stored on flash memory. Before a new

frame can be processed the FPGA needs to wait for the software system to generate the required

input data from the map. This is a source of delay caused by the interfacing with the second half

4.4. Performance Analysis 121

of the algorithm, as will be discussed in the next section. The software running on the dual-core

ARM Cortex-A9 in comparison achieved 2.25 frames/s on average, an order of magnitude slower,

using both cores to process in parallel at a frequency of 667MHz. The achieved framerate, when

accounting for the software overhead in the test system, corresponds to a processing time of

44 ms. This brings the acceleration from 21× to 10× for the full system to run in comparison

to the software version and finally, approximately 5× faster in comparison to the accelerator

presented in Chapter 3.

4.4 Performance Analysis

The two input ports of the core are connected through an AXI4 interface to two of the “high

performance” (HP) ports of the Zynq, where the accelerator acts as Master, and can sustain

a high throughput for reads and writes. In addition, an AXI4-Lite interface to one of the

“General Purpose” ports where the core acts as a slave is used to receive control and parameter

signals from the software. Data is moved using the maximum bus width, in this case 64-bits,

and unpacked in the hardware into two 32-bit floating point values.

The HP ports can scale up to a theoretical 1200 MB/sec of read bandwidth each at the maximum

frequency of 150 MHz, and with a continuous stream of 64-bit requests. In a test with a simple

producer/consumer hardware pair, a sustained read and write throughput up to 1000 MB/sec

each way has been tested successfully for a single HP port. However, there are design parameters

that end up reducing the maximum achievable bandwidth, including protocol overheads, the

frequency of the master connected to this port and the overheads of the memory management

unit (off-FPGA) and the off-chip memory chip itself.

Firstly, the ports have to use the full bus width of 64 bits to achieve their full bandwidth, while

operating at their maximum frequency of 150 MHz. However, the frequency is dependent on

the rest of the Tracking Core as the transactions are issued at the same clock as the rest of

the pipeline. As a consequence of the HLS tools used the entire pipeline generated had to be

configured to utilize a single clock input. We need to process a set of 5 floating-point values for


each iteration of the Tracking Core. Transferring them using multiple 64-bit ports, a percentage

of the total bandwidth will be wasted, because of the mismatch of the memory transfers per

cycle and the corresponding consumption rate of the hardware.

Looking at the case of transferring 5 values per iteration through three hardware ports, one

can bring in 6 values per cycle. That means that the core will be able to process exactly one

5-vector per cycle if computation can keep up, wasting 50% of the bandwidth of the third port.

As a result of this mismatch there can always be a lower than 100% utilization, either on the

side of computation or on communication. However, if the tools allowed a design where the

input blocks could run at a different clock than the rest of the pipeline, these blocks could

be more closely matched, allowing for more efficient interface design for some design points.

Finally, a main bottleneck, that becomes much more apparent in smaller boards, is the available

computational units. A smaller FPGA will offer fewer LUTs and DSPs to use and can only fit

design points with a lower throughput. However, by running the inputs with a slightly faster

rate than the Tracking Core, the pipeline utilization will get closer to 100%, leading to more

efficient use of the available compute performance as the memory system’s latency is hidden.

4.5 Conclusions

The work presented in this Chapter demonstrates a high performance state-of-the-art tracking

architecture with an embedded-grade power consumption, by utilizing a SoC with reconfigurable

logic. This work proved the effectiveness of the dataflow paradigm for tracking in semi-dense

SLAM despite the non-regularity of some computations. It demonstrated the effectiveness of

separating communication and computation, to manage the existing memory bandwidth in the

most efficient way, and utilized multi-port specially partitioned caches to deal with the latency

of accessing a random window from an image frame.

The proposed architecture is utilized in a full state-of-the-art semi-dense SLAM implementation,

with an estimated 2.7 W of total power. The FPGA performed direct semi-dense tracking,

acting as a co-processor to a dual-core ARM Cortex-A9, with the full system achieving more

4.5. Conclusions 123

than 22 frames/s tracking performance. The architecture is scalable, so it can target FPGAs

with different resource profiles and is designed for significantly faster operation, thus paving

the way to a complete embedded SLAM solution.

In the next Chapter an accelerator is proposed to close the loop of SLAM and fully address the

requirements of the other half of real-time SLAM, the map generation.

Chapter 5

Accelerating Mapping for SLAM

The previous chapter presented a high-performance, low-power design to

accelerate the task of tracking in semi-dense SLAM. This enabled for

the first time the acceleration of a state-of-the-art semi-dense SLAM to

achieve real-time performance on an embedded low-power platform, without

any compromise in the quality of the result. That work paved the way

to a complete embedded SLAM solution, but since real-time operation is

composed of two strongly interdependent tasks, tracking and mapping, the

second part has to be addressed as well. These two tasks have to happen in

real-time and comprise the majority of the frame-to-frame computational

requirements of the application. Thus, the acceleration of the mapping task

was selected as the next research direction and this chapter presents this

author’s original work towards achieving the end of a complete hardware

architecture to target embedded semi-dense SLAM.


Bouganis [‘A Scalable FPGA-based Architecture for Depth Estimation in

SLAM’ Boikos Konstantinos et al., ARC 2019 ]

124

5.1. Motivation 125

5.1 Motivation

SLAM is composed of two interdependent tasks, strongly tied together, tracking and mapping.

In an unknown environment an accurate output from the first is needed to start generating

the second and vice-versa. So far, research and design of custom hardware for tracking has

been presented. This was the first component to address since it is the one most sensitive

to large latencies in processing and dropped frames. This work targets high performance, to

address the latency requirements of the application, and low-power consumption, to enable

the embedding of this type of application in low-power, lightweight robotic platforms such

as quadcopters and household robotics. Having designed FPGA-based accelerators to enable

real-time performance for semi-dense tracking with low-power embedded hardware, the next

challenge towards enabling state-of-the-art semi-dense SLAM on such platforms is towards

accelerating the task of mapping.

The two tasks of tracking and mapping, in Chapter 3 were confirmed to account for almost

80% of the computation cost of the algorithm. As well as demonstrating high computational

requirements, these two tasks have the same latency specifications, at, or close to, the camera

frame rate. This, together with their interdependency, means that to effectively accelerate

SLAM both have to be performed with a low latency. Everything else can be offloaded, post-

poned or computed in the background without significantly impacting the robustness of SLAM

and only affects the large-scale / long-term consistency of the solution. Moreover, most of the

inputs of these two tasks are either the output of the other, or shared. This makes the most

efficient choice the one where they inhabit the same chip, and time- and energy-costly memory

transactions and other communication are reduced.

In this context, this chapter presents a novel architecture with a combination of dataflow

processing and local on-chip caching to match the unique demands of a semi-dense mapping al-

gorithm. It is a custom FPGA-based accelerator architecture for semi-dense mapping, targeting

advanced, state-of-the-art semi-dense SLAM algorithms such as [17]. It follows and improves

upon design ideas from previous chapters, and introduces novel contributions, combining dy-

namic iteration pipelines and traditional streaming elements to achieve high performance and

126 Chapter 5. Accelerating Mapping for SLAM

power efficiency for the task of mapping.


• The design of a scalable and high performance, power efficient specialised accelerator

architecture, that can process and update a map in less than 20ms, the average latency

between two camera frames.

• A system-on-chip that combined with the work presented in Chapter 4 forms the first, to

the best of this author’s knowledge, complete SLAM accelerator on FPGAs, pushing the

state of the art in performance and quality for SLAM on low-power embedded devices.

5.2 Mapping Algorithm - LSD-SLAM

The target of the accelerator presented here is the mapping algorithm of LSD-SLAM [17],

following the principles described above in Section 2.6. This configuration has been proven

to provide the state of the art performance discussed in Chapter 2 and maintains the best

compatibility with this author’s existing work. The tracking accelerator presented in Chapter

4 targets LSD-SLAM and as such implementing this mapping algorithm as a base for this

architecture means the two architectures can cooperate with few modifications to previous

work, and the resulting system can be utilized directly as a drop-in replacement for the software

functions completing the same task.

After a successful execution of tracking, the relative 6-Degree-of-Freedom position and orien-

tation between the latest camera frame and the world is estimated. It is then the aim of the

mapping algorithm to use that information to triangulate points from two views; the current

camera frame and a previous frame in the camera’s trajectory stored with its world-to-camera

pose along with depth information in the Keyframe data structure.

All visible points with a sufficient gradient successfully matched from Keyframe to camera

frame will have a depth value stored in this data structure. Using this information, the mapping

5.2. Mapping Algorithm - LSD-SLAM 127

algorithm adds a new observation for the part of the environment that is observed for the first

time, and performs a filtering update to improve the observations of points also seen in the past.

At the end of this process, successfully observed points in space will have an estimated depth

and depth variance value stored in the Keyframe. The algorithm also keeps track of metadata

relating to how well this point has been tracked from different points in the camera’s trajectory

after being created. This information is used at different parts of the algorithm to add heuristic

optimisation which improves quality of depth estimation and the overall robustness.

As discussed in Chapter 3, the tasks involved in SLAM were profiled, running as software on

a high-end desktop CPU. The results showed that the mapping task was one of the heaviest

tasks happening during LSD-SLAM, at 38% computation time for all cores. Further testing

on the ARM-Cortex A9 of the FPGA board verified that the profiling results held true, with

timing tests measuring the mapping task at an average of 530ms per map update.

Algorithm 7 is a simplified high-level view of the tasks and functionality of the depth estima-

tion of LSD-SLAM that are implemented in the coprocessor. The pseudocode presented here

essentially summarises the tasks discussed in this Chapter and in Section 2, and introduces the

functionality in the order that it is implemented in hardware and discussed in Section 5.3. The

first step is updating the frame cache and Keyframe cache in the accelerator. These caches are

of the same size as the frame’s resolution, and store an up-to-date copy of the pixel values for

both the current camera frame and the Keyframe.

Then the three functions of generating new matches and observations, regularising by filling

gaps and calculating a smoothed regularised value are essentially executed in parallel as their

parts are implemented in the different hardware units demonstrated in Fig. 5.2 and discussed in

the following section. Their functionality is summarised in pseudocode statements that can be

used as a reference while reading through Section 5.3. Comments in green are used to indicate

which software functions the statements correspond to, but all the statements are executed in

sequence, in a dataflow fashion for all of the Keypoints.

At this point we copy the variable definitions stated first in Chapter 2 for ease of reference.


– Pose: Camera to world pose estimate for the Keyframe’s camera frame.

– Frame: A camera frame that was selected as a Keyframe candidate and its associated

metadata.

– Keypoint Array: Array has same dimensions as Frame (width x height). Each Keypoint

is a potential depth observation, and is composed of the following variables:

– Valid bit (Is there a valid observation for this pixel)

– Blacklisted (Has it been blacklisted after multiple failures to match)

– Validity Score (Used to keep track of successful or failed attempts to match)

– Inverse Depth (Depth estimate after a map update)

– Inverse Depth Variance (Estimated variance for above depth)

– Smoothed Inverse Depth (Smoothed version of inverse depth)

– Smoothed Inverse Depth Variance (Smoothed version of variance estimate)

– Max gradient array: Same dimensions as Frame. The maximum gradient in a region

around every pixel. A precise definition of its calculation is included in Chapter 3.

5.3 Architecture of the Mapping Accelerator

5.3.1 Coprocessor architecture and FPGA-SoCs

The architecture targets an FPGA-SoC that contains an FPGA fabric and a mobile CPU.

These platforms have been discussed in previous chapters, and the principles here on a system

level are similar to the ones discussed in Chapter 4, with the exception that in this work the

output bandwidth is more important and was designed to specs similar to the input bandwidth.

The CPU and FPGA function independently and can operate on the same memory space and

both have direct access to a common physical DRAM. There are master memory controllers

on the custom hardware for Direct Memory Access (DMA), designed to operate at full-speed

5.3. Architecture of the Mapping Accelerator 129

Algorithm 7 Update Map and Regularize

procedure Update Map and Regularize(Keyframe, tracked Frame, pose estimate,K)#Update and filter functions were fused into one streaming accelerator

Update Tracked Frame Cache #Copy latest tracked camera frame on-chip

if New Keyframe was generated thenUpdate Keyframe Cache #Gets triggered after create Keyframe executes

#Function Observe Depth

for each Keypoint k doCalculate max gradient in neighbourhood

if max gradient > gradient threshold thenCalculate epipolar line

if line and parameters are valid thenStart exhaustive search

if Successful Observation thenCreate observation or update depth value

#Function Regularize Fill Gaps

if k.valid == 0 then #No valid observation yet at pixel

#Sliding window buffering

Calculate average confidence in window of valid Keypointsif neighbourhood confidence > confidence threshold then

Calculate weighted depth average #Weigh by V alidityScore

Initialize new Keypoint k #Gets weighted average values here

k.depth = depth averagek.valid = 1

#Function Calculate Regularized Depth

#Delayed sliding window, buffering results of Regularize Fill Gaps

if k.valid == 1 thenCalculate smoothed depth and variance valuek.smoothed inverse depth = smoothed depthk.smoothed inverse depth variance = smoothed depth variance

bursts for updating the caches before operation or to provide a constant stream of map points

for the execution of the algorithm. In addition to the high-speed memory connections, there

is a direct slave-to-master connection to the CPU, where the CPU acts as a master. In this

manner, the CPU has the high-level control of the coprocessor on the FPGA, and can change

its operating parameters and coordinate its operation with the software back end. Fig. 5.1, an

annotated figure from Chapter 2, demonstrates this system architecture.

In general, the co-processor architecture is designed to perform semi-dense mapping as part of

LSD-SLAM, with an existing Keyframe with depth and metadata for every observed pixel as

input and a filtered and updated version of that Keyframe with new and updated depth values

as output. In addition to the base operations of stereo matching and scanning along a variable


DRAM Controller

Reconfigurable Logic

512kB L2 Cache and Controller

L1 I/D Cache

Memory Interconnect

64-bit

Master Interconnect

for Slave Peripherals

FIFO

FIFO

FIFO

FIFO

32-bit

64-bit

ARM Cortex-A9Dual Core

Tracking Accelerator

General Purpose Slave Ports

Mapping AcceleratorHigh Perf.

AXI (HP[3:0])

Figure 5.1: Zynq 7-series FPGA-SoC and Interconnect. Only connections relevant to thearchitectures researched in this thesis are included for clarity of presentation.

baseline using epipolar geometry, the heuristic features of LSD-SLAM have been included in

the hardware, for example pixel quality metric, and limiting scans to a confidence interval on

the epipolar line as discussed in Chapter 2, performed in a streaming dataflow pattern. This

choice was made to keep compatibility with this state-of-the-art method and maintain the same

accuracy and robustness.

Nevertheless, in order to increase the performance that is attainable by the proposed custom

hardware design, the actual hardware implementation is modified with respect to the original

software implementation. For instance, a number of values such as the maximum gradient

in a neighbourhood were more efficiently calculated on the fly than pre-computed as done in

software. Additionally, most of the functions in the algorithm are combined in one streaming

pipeline utilizing buffers to overlap computation, as this avoids redundant memory traffic and

significantly improves performance and power efficiency. In contrast, software splits tradition-

ally the computation into different functions, looping over the same data a number of times,

once for each function.


5.3.2 High-level Architecture Overview and Functionality

As a first step, the “layout” of the hardware pipeline will be described with its high-level func-

tionality and some of the system-level and interface design choices that were made. Figure 5.2

contains a high level view of this architecture omitting some connections for clarity, such as

constant propagation and some extra connections to off-chip interfaces. The figure begins with

the I/O controller and lines, and aims to show the dataflow path, as well as the information

transferred. The connections between units are labelled up to the point of one Keypoint per 5

Cycles which shows the nominal processing, and production/consumption rate of the presented

architecture at its default configuration. In later subsections, the architecture’s scalability will

also be discussed, changing some of these parameters to target different performance points

with different resource requirements.

As a first step, before performing the depth-update that this accelerator is designed for, the

hardware performs an update of the caches. The write connections are not visible in Fig. 5.2 to

simplify presentation, but in the green and red arrows we can see the paths used for the majority

of the runtime, which are the read buses to two different processing units. The Keyframe pixel

cache does not need to be updated for all update steps, until the transition to a new Keyframe,

at which point the depth values are propagated to that (overlapping pixels are assigned the

depth value from the old Keyframe and an increased uncertainty), and a new camera frame

is used for the Keyframe. In contrast, for every Keyframe update a copy from the off-chip

DDR memory to the frame cache is required to fetch the newest tracked camera frame pixel

information. Other data, such as the pose of that tracked frame, are stored in hardware registers

updated through a userspace driver running as software on the mobile CPU. For the update of

these values and the rest of the control, the CPU uses the slave interface to the FPGA.

Following that, as input the architecture starts reading all the points of the Keyframe sequen-

tially from the off-chip memory. As soon as every updated Keyframe row is computed, it starts

streaming its output back to the same memory, at a different location. The functionality of

the first two units is to ensure a fast and consistent stream of Keyframe points. The “Input

Memory Controller” performs full-speed burst reads from the off-chip memory, that are then


BurstRead

64bits Per Cycle

Fast Pipeline

KeypointPer 3 Cycles

Input Memory

ControllerUnpack Unit

Keypoint and Gradient

Check

Epipolar Line and 5-Point

Unit

Generate Scan Points

Cache Request Handler

Subpixel Intensity

Calculation

Loop Processing

Unit

Fast-rate Pipeline

New Depth Calculation

Subpixel Stereo

Depth Integration

Fill Gaps Filter

Depth Regularize

Filter

Pack and Output

Controller

BurstWrite

KeypointPer 5 Cycles

Keyframe Pixel Cache

Camera FrameCache

Offchip Memory

Offchip Memory

…

……

…

Figure 5.2: Block diagram of the accelerator architecture

buffered and streamed as Keypoints from the “Unpack Unit” to the rest of the pipeline. The

information making up one point includes a confidence rating and validity indicator, as well as

past predictions for depth and depth variance. These points are comprised of precisely 192 bits

each, fetched as three words of 64 and later unpacked to the values they represent, as follows:

• unsigned 8-bit integer Valid → Flag indicating if depth estimate is valid for this point

• 8-bit integer Blacklisted → Indicator of repeated depth estimation fails

• 8-bit integer Validity score → Weighted score indicating rate of successful observations

• 8-bit integer Op-code → Used exclusively from the FPGA, exported to read from the

software, diagnose issues.

• 32-bit reserved Word→ Not currently used, kept as place-holder, to keep word size equal

to software implementation, at exactly 192 bits, and simplify Input and Output stages.

• 32-bit floating-point Inverse depth → Inverse depth estimate

• 32-bit floating-point Inverse depth variance → Inverse depth variance estimate

• 32-bit floating-point Smoothed inverse depth→ Smoothed Gaussian for visualisation,

search initialisation and other tasks

• 32-bit floating-point Smoothed inverse depth variance → Variance for smoothed

Gaussian


Internally, a representation of a single bit is propagated along the pipeline for the Valid flag

to reduce resource usage, especially but not exclusively, in the case of on-chip Block RAMs.

For all the other variables, the same word size and data type as above was used and a buffered

stream was created for each one. These streams are bundled together as a Keypoint stream,

indicated by the main arrows in Fig. 5.2

As discussed in Section 2.6, the first step once target points start streaming in is to determine

which ones are valid, and good candidates, with a local gradient above a certain threshold, to

attempt to match. This starts by checking their valid flag, a pixel quality score maintained

over successive attempts to map, as well as calculating on the fly the max gradient of each

pixel and its immediate neighbours. The condition is that the maximum magnitude of the

intensity gradient in this group of pixels must be above a threshold to consider matching this

pixel later on. As Keypoints start streaming in, the hardware buffers the first four rows of the

Keyframe, and maintains a 5x5 register array which maps to a sliding window panning over

the Keyframe’s intensity values left to right, in successive rows as shown Fig. 5.3.

Maintaining the reading pattern of Fig. 5.3 is important to ensure the most efficient access

pattern for heavily re-used memory regions. By taking advantage of the fact that the points

of the Keyframe are accessed sequentially and in this part of the processing all accesses are

going to be in an immediate neighbourhood of a 2-pixel radius, multiple accesses to the same

value are mapped to registers instead of going to the cache. To enable the implementation of

this register window, an additional four row buffers are used of equal size to the image width.

The sliding window itself is implemented with an array of 25 registers, with the current target

stored in the centre. A diagram of this configuration is presented in Fig. 5.4.

As this scanning happens, these registers are used by the “Keypoint and Gradient Check” unit,

which is responsible for the on-the-fly calculation of the maximum gradient in a neighbourhood

of the pixel. First the gradients in the two image dimensions, horizontal (denoted with the

variable x here) and vertical (denoted with the variable y), are calculated for the set of five

pixels. These are the target pixel, the ones immediately left and right of it, and the ones directly

above and below. Following that, a gradient magnitude is calculated for these five pixels as the


Figure 5.3: Sliding window over current keyframe

KeyframePixel Stream

Row 0 Buffer

Row 1 Buffer

Row 2 Buffer

Row 3 Buffer

X

Figure 5.4: Sliding window utilizes shift registers and 4 row buffers


X

Bottom

Top

Left Right

dy

dx

(x, y)

(x, y+1)

(x, y-1)

(x-1, y) (x+1, y)

𝑔𝑟𝑎𝑑(𝑥, 𝑦) = 𝑑𝑥2 + 𝑑𝑦2

𝑑𝑥(𝑥, 𝑦) = 𝐼 𝑥 + 1, 𝑦 − 𝐼 𝑥 − 1, 𝑦

𝑑𝑦(𝑥, 𝑦) = 𝐼 𝑥, 𝑦 + 1 − 𝐼 𝑥, 𝑦 − 1

Figure 5.5: The intensity gradient is calculated in the two image directions, for the target pixeland its four immediate neighbours, resulting in a total of 13 accesses, 20 gradient calculationsand a final reduce operation to find the maximum value in the region

square root of the sum of the squared horizontal and vertical gradient√dx2 + dy2 as shown in

Fig. 5.5. This represents a total of 20 accesses completely served by the 5x5 register window,

which results in lower latency and an improved performance over direct cache access. Finally,

the maximum of these values is calculated and compared to the threshold to result in a pass or

fail condition.

In Fig. 5.5, a cross pattern of reads is employed. For each maximum gradient calculation,

20 sets of subtractions are scheduled on an array of floating-point pipelined adder units and

then the results are squared and accumulated at a square root hardware block. The whole

chain of operations is pipelined and allocated to a time-shared array of adders and multiply-

accumulators to match the desired processing rate, a process that will be discussed further in

the next subsection.

Based on the gradient threshold for the area and the pixel’s confidence rating from the metrics

stored in the variables Blacklisted and Validity score, the Keypoint’s fitness is calculated as a

candidate to try to map. It is then forwarded to the “Epipolar Line and 5-Point Unit” that is

responsible for calculating the epipolar line’s equation and its overlap with the camera frame

to determine the scan range, centre and steps, as in the discussion of Section 2.6, demonstrated

here in Fig. 5.6.


Real World Point X

OR

X2X1

X3

eL eROL

XL XR

Epipolar Line

Figure 5.6: Epipolar geometry, the epipolar line is depicted with orange colour. While thepoint will lie on the line, it does not have to appear on the camera’s frame as it can lie outsidethat plane.

This unit is firstly characterised by more complicated control, since several operations will

be skipped or repeated depending on the values during runtime, leading to the necessity of

some extra hardware units allocated for the worst case scenario. For example, an epipolar

line segment that is too short in the confidence interval will have to be “padded” on the fly

and if the starting point is outside it will be moved toward the border of the frame and the

whole line segment will be re-tested. It also contains several floating point operations, including

multiplications and divisions to calculate and check the robustness of the search, including the

epipolar line angle, the portion inside the camera view and the length of that section. Finally,

this unit calculates the 5-point SSD pattern, as discussed in Section 2.6, that will be compared

later in the loop processing unit to find the best match along the epipolar line.

To optimise this for a hardware pipeline, some conditional execution paths were simplified with

some logic checks triggering various flag bits, propagated along the pipeline to reduce resource

usage. The one exception that adds complexity is a computation path where, if the epipolar

line has one side of it partially outside the frame, that point of the segment will be moved


inside the frame and the segment will have some checks repeated. This comes with dedicated

hardware resources, but was kept as part of the effort to maintain equivalence with the software

method in hardware. If all the checks are valid, at the end this information is forwarded to

the next section of the pipeline, together with a pre-calculation of the scan iterations and steps

and the five points to scan for. If it fails, the map point is still forwarded to be used for later

processing such as filtering, followed with flags to mark this decision and the reason for failure,

but it will not trigger a scan.

The next group of hardware blocks is a collection of units that will be referred to as the fast-rate

pipeline. This part of the proposed architecture aims to efficiently map the central part of the

algorithm, which contains a dynamic number of iterations per point ranging from 1 to tens,

to a buffered unit that will receive Keypoints at a rate matching the rest of the pipeline but

perform the necessary processing at a faster rate to account for the varied duration of these

dynamic iterations. In the fast-rate pipeline, the necessary cache accesses and error calculations

for the scanning of the epipolar line and the selection of the two best candidate locations are

performed, at a faster rate than the rest of the hardware. The thicker lines at the centre of the

units describe the streams for the information pertaining to this scan, shown in Fig. 5.2, while

the thinner lines correspond to the map point together with its metadata being forwarded,

coming from the units outside.

In this faster rate pipeline, the “Generate Scan Points” unit supplies a steady stream of pixel

locations to be fetched from the cache unit, according to the calculations in the Epipolar line

unit. This unit essentially acts as a bridge between the two rates of processing. One side of it

reads available target points (that are either ready to be processed, or turned out to fail some

condition at some step of the pre-processing). For the ones that will require a scan, part of

the information read is the start, finish and step size of the scan. Using this, a second pipeline

is initialised inside this unit for a predetermined number of iterations. It writes out a rapid

succession of coordinates for the points to be scanned, that are forwarded to the cache request

handler. Meanwhile the other metadata is also passed along a slower line through the Cache

Request and Subpixel Intensity Calculation units, through to the Loop Processing Unit that

has a dual pipeline inside as well.


Using the points coming from the previous unit, the “Cache Request Handler” fetches these

pixels from the caches and forwards them to the “Subpixel Intensity Calculation” unit where

linear interpolation is performed in a neighbourhood of 4 pixels around the floating point

coordinates. This separation of functionality is deliberate. By separating the calculation of

pixel coordinates, for a number of iterations that are unknown by the unit in its functionality,

but come as just-in-time information from previous units, it is possible to separate the access

address to the actual memory requests by buffers, and then optimise separately. This has the

effect of allowing simpler, smaller units, that as soon as their buffers start filling up will ramp-

up to a sustained access rate, operating at a high utilization rate and offering high performance

with a comparatively lower design complexity.

All these streams are passed on to the “Loop Processing Unit” (LPU) that performs the core of

the scanning algorithm, the functionality of which is demonstrated in Fig. 5.7. On the left side

of the figure the successive scanning steps ( overlapping sets of 5 comparisons ) are illustrated

as white dots on the epipolar line. The matching error on the right side of the figure does

not have units, as the error is a relative value, produced from the sum of squared errors in

pixel intensity. The horizontal axis represents scan steps, the stride of which is relative to the

baseline and resolution for the scan.

The LPU block first reconstructs the pattern of 5 pixels that are scanned for, using flags to

identify the type of data it is receiving, as the previous units are intentionally agnostic of the

algorithm stage they are operating for. It then performs the scan steps to find the position

with the minimum sum of squared errors. It stores internally the best match and the second

best match. It also maintains additional information regarding the search. This includes the

steps performed, the distance of the search and the matching error. As we can see in Fig. 5.7,

in successful matches there is usually one strong minimum, with an error function that appears

relatively convex. To estimate if a best match location indicates a successful match or a false

positive a number of metrics are checked, including the magnitude of the error and if the two

best candidates are more than a step’s width apart that their errors differ by more than a

certain factor (for example 1.5×).


Matching Error

Figure 5.7: For each comparison, the previous four values are re-used and a new one is calculatedby interpolating the values of the four pixels surrounding the floating point coordinates of thenext scan point

On a match considered successful, a linear interpolation step is added that attempts to further

optimise the solutions by interpolating between the two best discrete error positions (the func-

tion in the figure is not continuous but rather a set of samples). In other cases, especially in

larger baselines, repeating patterns or other sources of error (e.g. occluded parts of the scene)

will lead to multiple local minima and potentially false matches. For example a pattern such

as the one in Fig. 5.8, with its repeating texture, will lead to multiple local minima with com-

parable errors. The distance between the two best matches and the magnitude of the matching

error can be employed as discussed in the previous paragraph to potentially reject this obser-

vation as invalid. To improve the accuracy of the algorithm, different heuristics such as this

are employed to attempt to detect and remove certain sources of error, such as “blacklisting”

points if they repeatedly fail to be tracked successfully.

After a scan is completed, this is forwarded to the “New depth Calculation” unit, that continues

the processing of the mapped Keypoint at the slower rate of the rest of the pipeline. This

calculates a new depth and depth variance value based on the results of the LPU as described

in the previous paragraph, which the next unit “Subpixel Stereo” can further refine if the

conditions are right for a sub-pixel disparity search, as discussed in the previous paragraph.


Matching Error

Figure 5.8: In this case, intensity information, combined with a large baseline in absense of astrong previous estimate, is insufficient to provide a good match.

Finally, this information is streamed to the “Depth Integration” and the Filter units. The first

is responsible for putting all the information together for each map point, integrating results

from the two units previous and updating the metadata of the point, described in the beginning

of the section, as necessary, including a revised confidence rating. The filter units perform

regularization operations. The first one, upon finding sufficient confidence in observations in

a window around a pixel that lacks an observation itself, can add an estimate for it with a

weighted average of its valid neighbours.

The second filter, “Depth regularize filter” calculates a smoothed value for the depth and

variance of valid map points, stored separately to the actual depth, again operating on a

sliding window around a centre pixel. For both filter units, row buffers allow region of interest

processing, without losing the efficiency of the streaming interface, but simply adding one time

latency to the total latency of the accelerator. For both cases there is buffer priming before

the first data point even arrives, and the total latency can be well estimated as the amount

of time to receive 4 rows of the Keyframe, at less than 1% of the total running time for a

vertical resolution of 480 pixel rows. After the processing and filtering finishes, the operations

performed at the input are reversed in a pack-and-output unit that streams they points of the

Keyframe out to the off-chip DRAM utilizing burst write transactions.


5.3.3 Multi-rate dataflow operation

Semi-dense SLAM is characterised by a large amount of data that needs to be processed. For

a map of a typical for the state of the art size of 640 × 480, the depth map representation

will take up 7.37MBytes. That is in addition to the actual frame size of 307KBytes. To put

that into perspective, in order to process 60 frames per second as they come from a camera

and extract their depth information, the total time between captured frames is less than 17ms,

but that amount of data requires approximately half that time just to be read from memory

using a dedicated port on the reconfigurable fabric at the typical memory bandwidth available

on off-the-shelf FPGA-SoCs. To keep up with that latency, at a typical frequency achievable

on an HLS-designed FPGA accelerator of 100 MHz, it would be necessary to process one map

point every 6 cycles on average.

For devices of the Zynq family targeted throughout this thesis, the maximum bandwidth of the

memory controller is 4.2GB/s as presented in the Xilinx Zynq Technical Reference Manual [76]

but with a real achievable bandwidth depending on the type of access (efficiency of sequential

reads/writes will be higher than random accesses) and reported for this controller in the same

TRM in the region of 80-90%. On the FPGA side, there are 4 HP-ports, as discussed here

and in Chapter 2, that can handle a maximum of one 64-bit request per cycle, at a maximum

frequency of 150 MHz for a theoretical maximum of 1200MB/s per port. For a typical design

that can generate requests in two HP ports, at 100 MHz, the achievable bandwidth for sequential

reads/writes will be in the region of 1.6GB/s, which means an 8MByte sequential read will need

approximately 5 ms to complete.

A straightforward implementation would be to design units capable of performing a full epipolar

line scan in that time. However, the average computational cost for a scan is three times less

than the peak computational cost. Simultaneously, for most rows the majority of Keypoints

will not require an epipolar line scan. If a design targets a fixed latency of 6 cycles for the worst-

case load, it will result in an underutilized and less power-efficient accelerator with significantly

higher resource utilization.


Alternatively certain properties of semi-dense SLAM can be leveraged to design a more efficient

solution. An epipolar line scan often is not required when the point does not currently contain

a valid observation or is not visible in the current frame. Moreover, in confident observations,

it can be safely reduced to the region d ± 2σd, as described in Section 2.6 of the Background

Chapter. The designed coprocessor takes advantage of the pattern and frequency of the afore-

mentioned cases by utilizing fully pipelined units, each designed to efficiently execute a part

of the computation of the entire algorithm, as discussed in Section 5.3.2. The units are sepa-

rated by large buffers, to not propagate the variable processing delay backwards through the

pipelines. They are also designed with streaming dataflow in mind, allowing a design to be-

come rate-based and be decoupled in a large part from the possible high processing latency of

a particularly busy neighbourhood of pixels.

The most efficient design was found to be self-contained, deeply-pipelined hardware blocks that

perform the different types of operations on a static schedule, re-using a number of compute

units to achieve a certain rate of processing per cycle. The input and output of these pipelines

is a streaming buffer, that allows a unit to stop functioning, until a next data point is available.

At the same time, different parts of the algorithm can be overlapped in the same hardware

units, re-using some of the hardware for different operations, everything being designed with

the principle of data always moving forward. The pipelines contain multiple compute units for

multiplication, addition and division, and logic and multiplexers shift the structure of the unit

as necessary. This way they can change from an initialization phase, to operating on points,

to scanning across the frame cache, depending on the unit, or skipping a scan and forwarding

metadata to the next unit in the pipeline.

The units were also designed to operate at different rates, with fast-rate processing units in

the fast-rate pipeline to access the frame cache, perform epipolar scans and find the two best

matches, and more relaxed processing at the surrounding stages. The on-chip cache access hap-

pens at a rate of one access window (2x2 pixels on the implemented design) per cycle buffered,

allowing a simple and highly efficient cache controller in place. The units are connected to

each other through large streaming FIFO buffers that allow communication to happen asyn-

chronously, and hide penalties on latency that would arise from the variable processing rate


design. In this way this architecture achieves a higher performance level for the size of the

targeted reconfigurable devices, using resources more efficiently than a pipeline designed to

operate at a rate targeting the worst case scenario for the scanning latency.

As shown in Fig. 5.2, the units in the centre of the pipeline inside the orange box operate at

a faster rate than the rest. Focusing on the region of the fast-rate pipeline as demonstrated

in Fig. 5.9, there are three modes of operation for the units at the sides. When an epipolar

line scan and depth update is necessary, they perform initialisation steps and then set up the

rest of the control to perform a scan. In the second mode, they simply generate one scan step

per cycle. Finally, if there is a point that does not require a depth update, they just directly

forward that point’s associated metadata to the next unit in a single cycle. They essentially

act as a dual-rate interface between the flow of data through the rest of the algorithm and the

scan steps happening inside.

Epipolar Line and 5-Point

Unit

Cache Request Handler

Subpixel Intensity

Calculation

New Depth Calculation

…

…

In. it.

Scan Calc.

1 Cycle -Processing

Interval

Generate Scan Points

Loop Processing Unit

LPUMain

Res. Fwd

Fast-rate Pipeline

5 Cycle -Processing

Interval

Figure 5.9: The units in the fast-rate pipeline operate at two rates simultaneously, with somecontrol processes, initialization and communication with the rest of the pipeline happening onlywhen starting or finishing a scan and and the main compute units operating at a rate of onescan step per cycle.

As such, part of the pipeline in Generate Scan points reads a set of values from the Epipolar

Line unit to initialise the counters and control for the rest of this unit. In a similar fashion, the

LPU main unit processes scan steps but needs to be aware of the parameters for the current

scan indicating when to initialize a new scan or integrate and forward the final results to New

Depth Calculation Unit. The metadata that guide this process pass along buffers every time

a new scan starts, indicated in Fig. 5.9 by the thinner arrows between the units, while the


higher-rate information dealing with the scan passes through separate buffers indicated with

the thicker lines in the figure. The units in the middle, Cache Request Handler and Subpixel

Intensity Calculation, do not have to be aware of the part of the process they are serving. This

is enabled by taking advantage of the symmetry of initialization accesses and scan accesses, so

that these units simply process data in the same way and pass them forward.

As has been mentioned, this architecture is scalable in terms of compute units allocated for the

hardware blocks, a process guided by the desired processing rate inside the fast-rate pipeline

and for the rest of the hardware. By reducing the amount of multiplier, divider and accumula-

tion units built in each hardware block and time sharing them more aggressively for different

operations, it is possible to increase the amount of cycles necessary for a scan but with an

almost linear decrease in resources for that unit. The most efficient designs must have units

in the fast-rate pipeline match with each other, as otherwise the slowest one would dictate

rate of processing making the rest of the units underutilized. In a similar fashion the rate

of processing for the units before the fast-rate pipeline should be tuned as one number, and

the same or slightly slower processing rate should be targeted for the units after the fast-rate

pipeline. Thus, essentially two numbers, these two processing rates trade off between targeted

performance and resources. The resulting architecture therefore can scale to different FPGA

devices and resource budgets. In Section 5.4 different example design points are presented,

which were achieved by changing the target processing rates as described previously.

5.3.4 Performance Analysis

To explore the hardware rates, as described in Section 5.3.3, that maximise the achievable

performance given the available hardware resources and verify that the design assumptions

discussed so far hold when running with real-world datasets, monitoring instrumentation was

added in the software version of LSD-SLAM and it was executed for the entire duration of the

datasets used in the evaluation section. Firstly, statistics were collected regarding the average

processing load that is expected for each iteration of a map update across all the mapped

frames. Then more detailed samples were acquired to study the distribution and extrema of


the amount of epipolar line scan steps per map point.

100 200 300 400 500 600

50

100

150

200

250

300

350

400

450

0

0.05

0.1

0.15

0.2

0.25

0.3

Figure 5.10: Heatmap of Depth Map valid points for epipolar scan. Axes represent imagecoordinates, with colour representing the frequency of a Keypoint in those coordinates requiringa scan

A subset of this data is visualised on the heatmap of Fig. 5.10. Its dimensions are equal to

the Keyframe image size used in the tested datasets and a typical LSD-SLAM implementation.

Each cell corresponds to a pixel of the Keyframe, and the colour indicates the frequency (0-1)

that that pixel would require an epipolar line search of any length for the duration of an entire

dataset. We can see that the frequency of use of any particular point rarely goes around the 30%

mark, and for most cases it is nearer to 5-15%. This figure is also interesting because it reveals

the tendency of the algorithm to focus on repeating patterns, in this cases it is a particularly

busy texture that is tracked in different positions in different Keyframes. It also demonstrates

an intuitive fact, that points of interest will concentrate more on the middle three/quarters

of the Keyframe and less at the top and bottom. This is expected, as to repeatedly map a

point successfully it needs to appear in successive Keyframes, something which has a lower


probability of happening at the top and bottom borders of the camera, but can happen at the

left and right borders as the camera in the selected datasets pans more often left or right than

up or down.

Sampling for peak loads across a dataset revealed that the amount of points per line that require

scanning peaks around the centre of the image at a frequency averaging at 18% of points in

a single row. By looking for extrema some outlier cases were discovered, which however were

usually less than 1-2% of the frames processed. Those have to do with special cases consisting

of initialisation steps or very sharp motions. However, the worst case scenario will always

have an upper bound, found by assuming all buffers are full. That latency ceiling has a linear

relationship with the fast-rate pipeline processing rate and the processing load per frame, as in

that case the rest of the pipeline has to wait for every scan to finish before forwarding the next

point. Hence, this behaviour can be predicted and designed against.

The average epipolar line scan length was calculated at 11 steps for the tested datasets. Given

the results in Fig. 5.10, if 25% of the points in a line require an epipolar scan, and the other 75%

are skipped in one cycle, the total cycles per row would be 2240 cycles, or 3.5 cycles per point

for an average of 11 scan steps for the scan (fast-rate) pipeline. In the implemented unit, the

design choice was a processing rate of one scan/interpolation per cycle in the fast-rate pipeline,

matching the figures above, and a processing rate of one target point per 5 cycles in all the

other units, which targets an overall latency of more than 62 frames per second for a resolution

of 640x480. This leaves a good margin of safety to absorb the latency of thousands of extra

scans above the typical values without dropping below the target framerate. It was found that

for the datasets tested, this relationship of 5-to-1 was a good ratio for the processing rates of

the pipelines with performance being very close to the theoretical maximum for the majority

of map updates.

As we will discuss in the evaluation, in real-world testing the pipeline did indeed perform as

expected, with less than 1 update in a thousand deviating significantly from the performance

targeted. Moreover, in actual tests the software version on both tested platforms had a worse

behaviour in such cases with an increase of almost 200% in the processing time for some frames.

5.4. Evaluation 147

These results are discussed in more detail in Section 5.4. Thus, this work, besides delivering

the promised performance and performance-to-watt figures, delivers a much more predictable

performance which guarantees far fewer frame drops and map updates than a software platform,

meaning a higher quality result for SLAM when operating in real-time, in real-world conditions.

Lastly, the proposed architecture is tunable and can be changed to adapt to different application

requirements. One option is to increase the capabilities of the fast-rate pipeline to have the

system guarantee a very small performance degradation, even in outlier cases, at the cost of

some underutilized resources. Alternatively, if the application allows, one can go the other way

and under-provision the fast-rate pipeline to target a more resource-and-power efficient system,

by allowing some degradation of a few percentage points in more cluttered scenes. In Fig. 5.11

of the Evaluation, we will see the scaling to target different performance points. The 32.5 fps

and 42.5 fps are examples of a design point where an extra cost in resources guarantees a lower

maximum latency, and therefore a higher target performance.

5.4 Evaluation

5.4.1 Experimental Setup

In this chapter, all of the experiments were conducted on the Xilinx ZC706 Evaluation board.

The board comes with a Zynq-7045 FPGA-SoC [76] and two DDR memories, each 1GB of

DRAM, of which only one was accessible to both the FPGA and to the mobile CPU and is

the one used in these experiments. The SoC contains a dual-core ARM Cortex-A9 clocked at

667 MHz and buses connecting the CPU to off-chip memory, and directly to the FPGA fabric.

On board, as is the case for the experiments conducted in Chapter 4, the mobile dual-core

CPU runs an Ubuntu distribution of the Linux Operating System. LSD-SLAM was compiled

and tested as software, and then the core functionality of the tracking and mapping tasks was

replaced with calls to the accelerator unit through userspace drivers that were designed as part

of this work, using a direct slave interface from FPGA to CPU set up through the Xilinx Vivado


toolchain.

5.4.2 Benchmark selection and Platforms

The Benchmarks used for the design exploration and the performance comparison of this work

were the Room and Machine Hall trajectories supplied by LSD-SLAM’s authors on TUM’s

website: https://vision.in.tum.de/research/vslam/lsdslam. These were run in three

platforms, in a non-real-time setup, where the tracking and mapping tasks were run separately,

on a step-by-step basis, once for each frame to maximise the software performance on general

purpose hardware and to avoid affecting the experimental results by effects other than the

algorithm and hardware’s performance. The first general purpose hardware platform is a high-

end desktop machine, with 16GB of DDR3 RAM and an Intel Core-i7 4770 CPU with Turbo

mode enabled running at a sustained 3.77 GHz with all cores loaded executing the software. The

second one, to provide a state-of-the-art comparison point for mobile/embedded platforms, was

an Nvidia Tegra X1 board, utilizing the on-board ARM Cortex-A57 CPU clocked at 1.73 GHz

alongside 4GB of onboard RAM.

5.4.3 Design Implementation and Resource Usage

The architecture described in the previous section is designed to be platform agnostic and

optimised on resource usage. Nevertheless, the use of Vivado HLS tools drove a number of

implementation decisions in order to develop and test the IP on the target FPGA-SoC, leading

to certain overheads1. For evaluation, the design was synthesized and placed-and-routed with

Vivado HLS and Vivado Design Suite (v[2018.2]), targeting a Xilinx Zynq ZC706 board and

run and tested on the same board. For the parameters described in Section 5.3, timing was

met for the coprocessor at 125 MHz. The resource usage for that result, post-implementation,

1For example, since the tool always rounds up memory size to the next power of two for BRAM utilization,a choice was made to partition in two dimensions cyclically by a factor of 5, a non-power of 2 factor. Thissignificantly reduced the memory overheads, fitting eventually both accelerators on the same device in thehighest performance configuration, at the cost of increased DSP and LUT usage.

5.4. Evaluation 149

Resource Map Unit Track and Map Units Available on Z-7045

LUT 151,674 184,993 218,600

LUTRAM 12,242 15,317 70,400

FF 213,761 256,665 437,200

BRAM 958 1089 1090

DSP 594 718 900

Table 5.1: Resources post-implementation

is described on Table 5.1. This was also combined with the design from [19] and both units

were successfully tested working side-by-side, setting the target frequency to 100 MHz .

Fig. 5.11 demonstrates the resource to performance scaling, post-implementation, for the pre-

sented design. The graph contains some examples ranging from a lower performance point up

to the 60 fps design point which was selected as the best performance candidate to allow a

second accelerator to fit in the larger ZC706 board. To raise the performance from 30 fps to 60

fps the main change is the increase of processing units, doubling the amount of Keypoints we

can process per cycle. As we have discussed in Section 2.4, under computational cost, this task

has a complexity of O(n) scaling with the number of Keypoints processed. So having tuned the

rest of the parameters (gradient threshold, resolution) for a good accuracy and robustness level

for our dataset, there is an average number of Keypoints to process, and doubling the number

of Keypoints processed per unit of time will give close to double the performance.

It should be noted for figure Fig. 5.11 that, while scaling from 15 to 30 to 60 fps is obtained

by increasing processing units until we can process on average twice the Keypoints per second

the 32.5 and 42.5 design points reflect a different goal. Due to the nature of the mapping

function, as we have discussed in previous sections, we allow some increase in latency to achieve

a more optimal resource to framerate ratio. These two design points reflect increasing many

intermediate resources to guarantee a better worst case behaviour. This will require a significant

increase in processing units and registers but offer only slightly better performance for the

average case.


15fps 30fps 32.5 fps 42.5 fps 60 fpsTarget Performance at 100MHz (frames/sec)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7Re

sour

ces r

atio

com

pare

d to

Zyn

q-70

45LUTsFFsDSPs

Figure 5.11: Resource scaling with architectural tuning targetting 100 MHz

We can see in the figure that different resources scale with different rates, owing to the fact

that the scalability so far targets only compute units placement, guided by the integer initiation

interval in High-level Synthesis while for now the interfaces to memory, cache sizes and lines

and inter-block communication are assumed static. As such, some resources such as the DSPs

which are ubiquitous in most compute units scale almost linearly with the target performance,

followed by the LUTs. On the other hands, Flip-Flops have a standard offset cost owing to

their extensive use in I/O and other fixed registers and their change in utilization reflects more

relaxed pipelining when targeting lower processing rates.

5.4.4 Performance and Power Comparison

In real world testing with two datasets from TUM (Room and Machine Hall)2, 98% of frames

were within one millisecond of the target processing time and more than 99% were within two.

The accelerator achieved a performance of more than 60 frames per second, on-par with a

hand-optimised, multi-threaded implementation on a high-end desktop CPU. There were some

2Room and Machine hall datasets supplied by the authors of LSD-SLAM on TUM’s website:https://vision.in.tum.de/research/vslam/lsdslam

5.4. Evaluation 151

outlier cases with a performance drop of up to 30%, from 16.3ms to 21 ms. For example in

the Machine hall dataset, one of the two depicted in the Evaluation Section, out of 3268 map

updates only 19 experienced a significant delay, many of them in the initialisation phase, with a

maximum recorded value of 20.9ms. However, that is considered acceptable in this application

for two reasons. Firstly, even in the case of accumulating a delay for successive updates, the

application can dismiss one dropped map update out of one thousand without a noticeable

degradation and can handle a lower mapping rate than the one targeted with the presented

design.

Secondly, as has been mentioned it delivers a performance far more stable than the software

versions. On Fig. 5.12, we can see a violin plot of the mapping performance (total processing

time for a map update step) on three high-end platforms across the two datasets. In this plot,

the colour corresponds to the platform, from left to right an Intel i7-4770 at 3.77 GHz, this work

at 100 MHz on a Zynq-7045 and the Cortex-A57 on a Tegra TX1 running at 1.73 GHz. The

width of the shape corresponds to the density of observations around a particular msec value,

similar to a sideways kernel density plot. The thicker, white line in the middle corresponds to

the mean value of the observations, while the thinner orange one to the median value. Finally

the lines at the top and bottom are the actual minimum and maximum value observed.

The figure demonstrates the variability of this processing load on general purpose hardware,

and how robust the proposed mapping accelerator is to these delays, appearing almost flat since

most of the observations were very close to the ideal value of approximately 16.2 ms at 100 MHz.

The reasons for this have to do with two main aspects of computation and communication;

what is the effective bandwidth of the platforms over the available and how it is overlapped

with computation. As this is a memory intensive application, cache misses and other memory

inefficiencies will cause significant delays to a general purpose CPU platform, especially in light

of multiple threads running at the same time and polluting the cache for each other. This effect,

as can be verified from the test data in Fig 5.12, will be exaggerated in the mobile platform

with smaller and simpler cache memories and predictors and smaller out-of-order execution

capabilities.


Room - PC Machine Hall - PC Room - This Work Machine Hall - This Work Room - Tegra TX1 Machine Hall - Tegra TX10

50000

100000

150000

200000

250000m

seco

nds

Figure 5.12: Mapping performance in msecs - Different Platforms / Datasets

In contrast, the hardware architecture presented in this chapter is designed specifically for this

application and takes advantage of the high bandwidth memory interface available, scheduling

reads at the best efficiency possible and hiding most memory latencies, while also caching the

randomly-accessed data in advance. It also overlaps computation and communication very

effectively and though operating at a lower frequency than both CPUs, it offers orders of

magnitude more compute units than a general purpose processor, performing significantly more

operations per cycle. In this way, its performance does not depend strictly on the number of

points to process, unless their concentration and number significantly exceeds what is expected.

In the CPU performance varies a lot in a frame by frame basis. Simple cases where there are

not many points to perform scans for, and where they might be concentrated in one area might

give better performance from simply having fewer net CPU instructions and fewer cache misses.

On the other hand, cases where there are many points will keep increasing CPU instructions

linearly, and their spread may tax the memory subsystem more, leading to a large increase in

computation time.

In addition to the performance, power consumption samples were collected for each platform,

5.4. Evaluation 153

i7-4770 @ 3.77GHz Tegra TX1 (Cortex-A57 @ 1.73GHz) This work @ 100MHz This work @ 50MHz

Power consumption during execution measured at wall

0

10

20

30

40

50

60

70

80

90

100

110wa

ttsStatic/idle board powerDynamic at runtime

Figure 5.13: Power consumption of the devices tested. Here, “This work” refers to the combinedpower of tracking and mapping accelerators and the CPU operating for the background tasksof SLAM.

using the current draw from the socket, so it includes all the peripherals plus the power supply

losses. The power results are collected while executing the full SLAM algorithm, so in the case

of the FPGA-SoC include both the mapping accelerator and a version of the accelerator for

tracking from Chapter 4. It is presented here separating static from dynamic power consump-

tion, to make it clearer how much the chips contribute at full load, versus board losses and

static power where unused peripherals would be a significant contributor as well. The measure-

ment is accurate to ±0.5 W. In the case of the presented accelerator it is possible to estimate

approximately 0.98 W of the static power draw to be because of the FPGA itself, according to

the Xilinx estimator in Vivado. Testing power draw with an unprogrammed FPGA, showed a

decrease in static power of approximately 1-2 W adding merit to this. A performance on par

with the high-end desktop CPU is measured, but for an order of magnitude less power for the

FPGA fabric to function.

We can see that the FGPA development board is at a similar power level at full load with

the Tegra TX1, but with more than a 4x increase in performance on average for the presented

accelerator design. The estimate for the chip power for the mobile CPU + FPGA fabric on the


Zynq-7045 is 6.5 watts using the Xilinx tools on Vivado post-implementation. Combined with

the results shown on Fig. 5.13, we can estimate the portion of the power requirements that are

due to the power supply and unused board peripherals. Static power figures are high for the

FPGA board since it includes several unused devices on the FPGA board, including SPI flash

and a second unused DDR chip that is only accessible from the programmable logic. On the

Tegra the GPU was set to run at idle clocks so that the power should reflect mainly the CPU’s

consumption with the added requirements for off-chip DRAM and power supplies on top of the

static draw from the GPU resources at idle voltage/frequency levels.

The aim of the accelerator, together with the work in Chapter 4, is to provide a complete

acceleration solution for LSD-SLAM, a state-of-the-art semi-dense SLAM method. The two

designed architectures both achieve real-time performance, evaluated running LSD-SLAM with

a pre-recorded dataset utilizing the two accelerators, with a board power draw at the wall

of approximately 15 W and an estimated chip power consumption of 6.5 W. So far we have

compared the performance of the accelerator to that of the software implementation executing

in an embedded and a desktop-grade CPU. Table 5.2 presents some representative examples of

the current state of the art both in SLAM algorithms as well as typical embedded solutions,

and is a reduced version of the Table at the end of Chapter 2.

5.4. Evaluation 155

Wor

kT

yp

eH

ardw

are

Pla

t.D

ensi

tyC

lose

-loop

Iner

tial

Typic

alP

ower

OR

B-S

LA

M[4

1]SL

AM

Lap

top

CP

USpar

seX

38-4

7W

LSD

-SL

AM

[17]

SL

AM

Lap

top

CP

USem

i-den

seX

40-5

0W

Whel

anet

al.

(RG

B-D

SL

AM

)[3

6]SL

AM

GP

UA

ccel

erat

edD

ense

X17

0-25

0W

Th

isw

ork

SL

AM

FP

GA

SoC

Sem

i-den

seX

6-7W

Leu

teneg

ger

etal

.[4

5]SL

AM

Lap

top

CP

USpar

seX

X30

-50W

SV

O[5

7]O

dom

etry

Lap

top/J

etso

n-T

x1

Spar

se30

-40W

/10-

15W

FP

GA

Acc

.of

OR

BE

xtr

acti

on[6

1]K

ernel

Acc

.F

PG

ASpar

se5.

3W

Nav

ion

[60]

Odom

etry

ASIC

-65

nm

CM

OS

Spar

seX

2-24

mW

Tab

le5.

2:Sta

te-o

f-th

e-ar

tSL

AM

exam

ple

s.C

ompiled

wit

ha

focu

son

feat

ure

san

dch

arac

teri

stic

sof

diff

eren

tso

luti

ons

todem

onst

rate

the

bre

adth

ofth

efiel

d.

This

isa

sim

ple

rve

rsio

nof

the

Tab

lein

Chap

ter2

.


The table is not meant to be exhaustive or rank the works. Instead, it was compiled to focus

on the characteristics of different solutions and provide an overview of different software and

hardware approaches to SLAM and their power characteristics3 The key takeaway is the gap

between fast but sparse odometry with no large-scale capabilities or loop-closure on embedded

systems and accurate, complex and dense solutions occupying different positions on the SLAM

landscape but requiring high-end hardware for real-time operation.

The second column indicates if the work attempts the whole task of tracking and mapping, with

a large-scale map, or simply tracking a trajectory (Odometry) or is an accelerator of a specific

kernel or operation meant to operate alongside a general purpose hardware running a SLAM or

Odometry algorithm. The next column indicates which hardware platform it targetted, with

the first three works being purely algorithmic works, targetting different map density outputs,

the next two after the work presented in this thesis specifically targeting lower-power platforms

and the last two rows custom hardware versions. Following, the qualitative sparsity level of the

works is presented, if they have large-scale capabilities and the ability to close loops (making

a large-scale map coherent) and if the work includes in this case an Inertial sensor as input as

well.

The final column is the typical power of the platform, which unless mentioned in the paper

is estimated based on the hardware used for typical loads as discussed in the footnote. Our

work, as we discuss in the next section, stands to provide a complete solution to bridge the gap

between the latest research in advanced SLAM algorithms, such as the three examples above

it in the table, and the existing work in embedded SLAM such as the two below. With the

two presented accelerators, the achieved performance-per-watt and latency can bring full semi-

dense direct SLAM capabilities, at a high performance, to the power envelope of an embedded

low-power device.

3The power figures were often not mentioned in works, or measured with varying methods. Thus, in theinterest of providing a qualitative view, a typical expected power is included for the chip/platform mentionedin the publications (e.g. nVidia 680GTX, Jetson TX1, Intel i7-4700MQ etc.). For the presented work, theestimated chip power is reported instead of the board power to be in line with other papers.

5.5. Conclusions 157

5.5 Conclusions

This chapter presents an FPGA-based architecture that achieves the required performance to

run high quality state-of-the-art semi-dense SLAM with high-end desktop performance at the

power level of an embedded device. It has good scalability and is parametrised to address various

SLAM specifications and target different FPGA-SoC devices, demonstrated by successfully

running alongside the accelerator presented in Chapter 4. Combined with the work presented

in Chapter 4, the accelerator presented here finally closes the loop of SLAM, providing high

performance and low-latency in an low-power package for both of the interdependent real-time

tasks constituting real-time, semi-dense SLAM.

The main findings of this work are that the most efficient designs for the target application

combine features that include a high-bandwidth streaming interface to common memory and

local caching of the region of interest or, if possible, the entire image frame processed, which is

tailored to the processing that needs to be performed. Dealing with the complex control-flow

of these algorithms a great fit was found to be multi-rate, multi-modal units, separated by

buffers. The most efficient and high performance choice was also found to be a pipeline design

that follows the dataflow paradigm, trying to move every data point through once. To have a

highly efficient design, memory accesses were separated from the rest of the computation, with

a stream of data flowing through the hardware carrying their control parameters as metadata

with them along the processing path.

5.5.1 Achievements of Thesis

In Background and more in-depth in Chapter 3, we discuss the principles of semi-dense SLAM,

why it needs high-performance operation and what needs to be targeted for acceleration towards

achieving that in an embedded, power-constrained environment. The nature of a state-of-the-

art semi-dense SLAM combines characteristics that make the performance-per-watt and fea-

tures of general purpose hardware insufficient. This research produced two specialised, custom

hardware architectures that, implemented on an FPGA-SoC, have achieved real-time tracking


and mapping performance at an order-of-magnitude better performance per watt than general

purpose CPUs.

At the beginning of this research, the state of the art in embedded SLAM consisted of sparse,

feature-based algorithms, often with reduced capabilities even compared to state-of-the-art

sparse SLAM such as [41], and usually constrained to simple trajectory estimation as odometry

applications. In contrast the accelerators presented here, while targetting the same power

constraints, have provided a high level of performance with no compromise in the quality of the

result for the two real-time components of state-of-the-art semi-dense SLAM. This work targets

a complete SLAM algorithm, with loop-closure, high-density reconstruction and large-scale

optimisation and stands to enable significantly richer and more advanced SLAM for embedded

platforms.

This research makes an important step on the way to realising many emerging applications in

robotics and augmented/virtual reality such as those discussed in Chapter 1. It was done with

the goal of offering the performance and power-efficiency necessary for a semi-dense SLAM to

operate in real-time but in a platform that can target different projects and devices. Meanwhile,

the lessons that emerged from it can be used to guide future hardware design in this space

towards embedded environment understanding, not limited to reconfigurable hardware but

offering a direction to guide domain-specific ASICs and perhaps even targeted optimisations

for general-purpose embedded hardware to enable a better performance-per-watt.

Moreover, because it was based on the idea of combining specialised hardware tightly inte-

grated with a mobile CPU running a full operating system, it can provide a base platform

to add capabilities to and extend for many different research projects. It can act as a self-

contained architecture, and therefore help close the gap between the FPGA community and

the algorithms/robotics communities, by making it easier to reuse this work even for researchers

without previous experience with reconfigurable platforms.

Chapter 6

Conclusions and Future Work

This thesis has so far explored the diverse field of Simultaneous Localisation and Mapping, dis-

cussed the gap between the capabilities of embedded SLAM and the state of the art in SLAM

algorithm research, and presented two high performance architectures that bridge this gap in

the context of high-performance direct, semi-dense SLAM with power-efficient custom hard-

ware acceleration. This was done in the context of developing platforms, such as autonomous

quadcopters and ground robots and augmented reality, with a greater degree of environment

awareness, while overcoming the challenge of low-power constraints they often come with.

The work presented in this thesis has addressed both parts of the research question, stated in

Section 1.4. On one side, it demonstrated successfully that it is possible to design a custom

hardware architecture to achieve advanced real-time SLAM in power constrained embedded

platforms. On the other, towards the design of such an architecture, a number of novel archi-

tectural and micro-architectural features and components have been presented as part of two

custom architectures that achieve the desired performance and power requirements. So far we

have discussed the main features and requirements for designing such an architecture. In the

following two sections we will discuss the main lessons drawn from this work, how they can be

generalised to apply to similar problems, and the proposed research directions for future work.

159

160 Chapter 6. Conclusions and Future Work

6.1 Lessons learnt designing with HLS and FPGA-SoCs

This section will present lessons drawn from the experience of designing custom hardware archi-

tectures for this type of application, as well as from exploring different designs and implementing

them using High-level synthesis and FPGA Systems-on-Chip.

Rapid design exploration and implementation with HLS

While the use of C/C++ to RTL synthesis tools, such as those used in this work, can signifi-

cantly shorten time of implementation giving the benefits we discussed in chapters 1 and 2, it

comes with two significant constraints.

1. Scheduling of operations, pipeline stages and processing units are automatically

inferred. This means that this part of the design is partially out of the hands of the designer.

Hence, in the cases where this leads to a series of inefficient design choices, it is not straight-

forward to amend. Organising HLS code in short loops and separating them by interfaces that

the tool will not automatically flatten can partially overcome the problems this causes. A way

to achieve this is using streaming FIFOs or hierarchy levels with flattening explicitly turned

off.

This creates a modular coding style, and leads to the generation of RTL that is easier to verify.

At the same time, the tool has to work at a smaller scheduling and state space, improving

the potential of identifying an efficient solution. These factors make it more likely to identify

problematic C/C++ that infers less than optimal hardware and easier to refactor that code to

end up with the desired RTL. In this context, dataflow architectures proved very useful for

efficiently combining this design principle with data-intensive algorithms such as SLAM.

2. Tools assume a one cycle latency for any interface access, even if this is connected

to a busy interconnect and through to an off-chip DRAM. This means that “tool reports”, that

estimate the latency and performance of the RTL, may miscalculate by one to two orders of

magnitude depending on the underlying architecture and generate highly inefficient memory

6.1. Lessons learnt designing with HLS and FPGA-SoCs 161

interfaces.

This can be overcome by using dedicated for-loops in HLS that perform sufficiently large se-

quences of serial accesses from memory to local buffers, independently of the rest of the RTL.

These can be utilized to generate more efficient types of accesses to a memory interface, such as

burst transactions of a desired length and width. Such a loop should be packaged in a dedicated

I/O unit and be paired to a sufficiently large buffer to place the elements that are read or ready

to be written out. Anticipating the algorithm’s data needs is essential to efficiently use such

a memory interface. Moreover, knowledge of the interconnect and I/O of the chip should be

taken into consideration to optimise memory accesses for a specific SoC, tuning word size and

burst size for optimal results.

Debug and verification of HLS-generated RTL

HLS tools have a big advantage in verifying the C/C++ code before it goes through a “Syn-

thesis” step to be converted to Verilog or VHDL. It can be compiled with a software compiler,

including debug symbols and run on the development computer in a debugger, with the ad-

vantages of state visibility, breakpoints etc. However, in this author’s experience, there are

significant challenges in verifying the underlying HLS-generated RTL.

While there was an RTL simulation included in the tool, it would not work for any non-trivial

design, instead crashing or freezing. In both cases, early cancellation or any other error would

result in no output of waveforms or other information, meaning there is no way of knowing

what was the cause. Meanwhile, valid HLS/C code, with behaviour identical to the software

when simulated as software, would often successfully synthesize to RTL that would then freeze

or corrupt its output when run on the FPGA with real data. Moreover, as the accelerators were

designed to operate in the context of an SoC with a full operating system and hardware/software

co-processing, the RTL under development could only be tested extensively when running in

that context, making bare-metal development tools incompatible.

While the above experience can be attributed partially to the immaturity of the tools, it is


not expected to change significantly in the near future and therefore makes it worth discussing

alternative solutions to support current efforts. A useful practice to facilitate the debugging

of what essentially becomes a black-box, because of the above issues, is the following: during

the development of a design, one should include extra “debug” ports in the RTL to enable

in-system debug, an idea also used to debug ASICs in silicon. A number of slave-type ports

can be used to expose the current value of key internal registers, to be read when the hardware

freezes or periodically. In addition, a master-type port can be connected from the RTL to the

off-chip memory and export internal state and the entire content of caches when triggered, or

in pre-determined intervals.

Using the two types of ports described above, along with careful selection of the state to export,

can assist towards the identification of potential issues not visible when run under a debugger,

or with limited testbenches. A useful side-effect is that the same ports and debug tools can be

used while running with real-world data in lock-step with a software solution, in order to check

the numerical accuracy of the hardware more extensively at different stages of the algorithm.

Good development practices

Finally, some other good practices that will assist in future work using HLS and FPGA-SoCs

are included here. The ideal scenario of cooperation should include a system-interrupt that

can be linked to the RTL. This way the CPU can have the maximum amount of cycles free

for other useful computation instead of acting as a controller or polling the hardware. Due

to significant latency in the various forms of communication between CPU and FPGA, tasks

that are under a few thousand CPU cycles and need HW/SW synchronisation are not good

candidates for acceleration, unless batch processing is possible (i.e. performing multiple tasks

in parallel). Memory coherency can become a significant issue for HW/SW cooperation under

an operating system and needs attention in such architectures.

6.2. Generalisation of the presented research 163

6.2 Generalisation of the presented research

Main application parameters

As discussed in section 2.4 and throughout this thesis, the performance of SLAM algorithms

is affected by key parameters, where it is possible to increase performance at the expense

of quality, or vice-versa. In the context of the algorithms we have discussed here the main

parameters are tracking and mapping resolution, point selection thresholds (such as maximum

gradient, validity score), and the tuning of noise and damping parameters in the optimisation

process of tracking and regularisation parameters used in the mapping function. While different

algorithms might have slightly different parameters to tune, computer vision and SLAM are

probabilistic algorithms dealing with noisy and imprecise data and attempting to recreate

a finite model of a world with infinite detail. As such, there will always be a tradeoff of

computation, density and quality to performance, attempting to optimise an algorithm to a

specific application and implementation.

The tuning of these parameters, while crucial in SLAM, is orthogonal to the work presented

in this thesis. The presented architectures are efficient for a large range of resolutions and

algorithm parameters, meaning that any improvement due to parameter tuning will be valid

for both the software and the hardware version of the functions involved. As such, the best

solutions will come from combining this research with optimising the algorithmic parameters

for specific environments and applications.

Scaling custom architectures with different algorithmic parameters

The above parameters will not affect efficiency but will affect performance indirectly in two

ways. Firstly, most of them will influence the total number of Keypoints to process, either

through resolution changes or selection criteria. Secondly, better accuracy and quality will

often lead to faster convergence meaning fewer optimisations steps and hence less computation

for tracking. In practice, this will not offset the upfront increase in computational load to

obtain that level of accuracy.


However, by carefully balancing the specific requirements of an application, that tuning can

be combined with a reduction of the mapping or tracking resolution used. This might not

degrade accuracy and quality to unacceptable levels for that application, but will require less

memory for caching and will bring almost as much a reduction in logic as well to achieve similar

performance levels for both the tracking and mapping task.

Bottlenecks to arbitrarily scaling this architecture on an FPGA-SoC

Scaling to a significantly larger FPGA would provide the architectures presented here with more

processing units to utilize to significantly accelerate all parallelizable computation tasks, but

would face two main bottlenecks. The first has to do with the caches used in this architecture,

while the second is related to off-chip bandwidth.

The first problem is that both tracking and mapping access random points on a frame cache.

After sufficiently improving the computational performance, the algorithm will need to access

multiple random windows in these caches in a single cycle. In a straightforward implementation

that would require scaling to multiple port RAMs, or even a possible duplication of caches. By

taking advantage of the application requirements, for example, that these caches are written

once and read multiple times, other solutions with multiple level caches and more complicated

control for sharing the same resource could be leveraged. While there are many potential

solutions to consider that will eventually push that bottleneck further, the fact is that the

cache designs we have so far considered will need to be replaced with larger-area-and-power

caches of higher complexity to provide sufficient bandwidth.

The second is a problem shared by many current high-performance hardware architectures such

as high-end general-purpose CPUs and GPUs currently. It is easier to add more compute in a

centralised location than feed it with the data it needs from another location on- and off-chip.

As the processing capabilities of the accelerators we discussed scale and caches get larger to

fit larger amounts of data, the resulting hardware on chip demands higher bandwidth from

off-chip sources. However, the latency and bandwidth of memory interfaces to off-chip DRAM

and the DRAM chips themselves have improved slower than the available compute on-chip, a

6.2. Generalisation of the presented research 165

trend that could continue in the future. As such, off-chip bandwidth may become a significant

bottleneck and have to be addressed by alternative solutions in the future.

On designing future systems

Drawing on the lessons learnt during the past few years’ research on this topic, the main focus

should be firstly on memory bandwidth , both on- and off-chip. As we discussed in the previous

paragraphs, memory bandwidth and cache design will become a crucial factor earlier in scaling

up. Considering the experience of designing and implementing the presented architectures, a

well-designed interconnect on an FPGA-SoC should be established first, as the logic can then

be utilized to its maximum potential, while the reverse will mean that in applications such as

SLAM there will always be underutilized logic.

Moreover, an ideal architecture would be sharing caches between accelerators and general-

purpose compute with some coherency protocol implemented in hardware. This would provide

an order-of-magnitude improvement in tasks such as synchronisation and co-processing latency,

as well as raw bandwidth for accessing key shared data structures. It would also allow compute

capabilities and therefore the frame-rate, resolution and density characteristics, to scale up

better compared to current system architectures in off-the-shelf FPGA-SoCs.

Deploying these architectures on devices with higher resource constraints

The presented research has been implemented and tested targetting a 5-10 W power consump-

tion. This was done for two reasons. Firstly, most other mobile platforms that have attempted

advanced SLAM had a similar or higher power consumption and as has been demonstrated

significantly less performance. Secondly, it was imposed by the development boards that were

available to conduct experiments. As it stands, the FPGA-SoCs we prototyped on could directly

be used with minimal modification on a real robotic system.

However, with its weight and power consumption, the system tested in our experiments would

be limited to larger ground robots. Using a smaller off-the-shelf board with a reduction in


performance or resolution could be used on larger, more powerful quadcopters. However, a

significant improvement could be obtained by integrating the FPGA-SoC into the system board

of the robot. This would reduce the power losses and large increase in weight brought by the

use of development boards. At that point it would already be an efficient solution for large

robots or vehicles, aiming to accelerate or offload state-of-the-art SLAM capabilities.

However, the lessons learnt here and the general architecture can both be transferred to the

development of more power constrained systems. Firstly, the principles we have presented for

streaming operations, dataflow processing, deep pipelines and the importance of caching hold

for any type of SLAM system, especially as the resolution and density increases. For semi-dense

SLAM, our architecture could fit in a smaller more power efficient FPGA, directly integrated

with a state-of-the-art CPU on a custom SoC using the latest semiconductor process technology.

Such a custom SoC, specialised for the application, would achieve real-time performance while

improving power consumption significantly compared to off-the-shelf FPGA-SoCs that include

many unnecessary components on-chip. Further reducing parameters such as resolution and

density can allow for the use of smaller logic to further improve power and a more power efficient

CPU, allowing the deployment to most battery power devices. Meanwhile, re-implementing

part of the architecture in a lower-level HDL such as Verilog and taking advantage of vendor-

specific features present in state-of-the-art FPGAs can improve the achievable frequency of the

presented architectures after place and route, providing a further performance improvement.

Finally, for applications in micro-aerial vehicles or augmented reality, where the weight and

power of an FPGA could still prove too high, the principles and architectures presented here

could guide the design of a completely custom ASIC. Such a specialised chip could implement

a heterogeneous platform in the form of dedicated accelerator hardware tightly integrated next

to, or even as part of, a general-purpose CPU with dedicated I/O for the application. This kind

of solution would provide at least an order of magnitude improvement in power and area for

the same performance compared to an FPGA fabric at the cost of significantly higher design

effort.

6.3. Research Conclusions 167

Upcoming advances in hardware that stand to significantly benefit this architecture

We have already discussed in this section the most important changes and advances towards

significantly improving what is achievable with this type of architecture, both in the context

of FPGA-SoCs as well as custom chips. Since a lot of the focus is on memory bandwidth

and latency, a concern shared with most high-performance digital hardware currently devel-

oped, solutions from that space if ported to FPGA-SoCs will provide a significant improvement

and allow an improved implementation of the design and research directions discussed in this

chapter.

The most important of these anticipated developments are 3D stacking of logic and High Band-

width Memory (3D stacked SDRAM chips with much wider, lower latency and energy commu-

nication buses). These stand to significantly improve memory bandwidth, as well as latency and

bandwidth in communication between different modules in a heterogeneous system. This type

of technology fits well with the application of the principles discussed in this chapter towards

higher performance and efficiency.

6.3 Research Conclusions

The first conclusion drawn from the work presented in this thesis is that the field of hardware

design is entering an era where hardware and software co-design is becoming more and more

crucial to achieve highly efficient and performant designs. One of the main bottlenecks in the

work presented in Chapter 3 comes from attempting to accelerate a task that was developed

using software engineering principles disconnected from the underlying hardware. Redesigning

parts on the software level, in the context of both general purpose hardware, and especially with

the existence of a custom hardware accelerator operating in the same memory space, can sig-

nificantly improve the efficiency of many tasks without changing the principles or functionality

of the underlying SLAM algorithm.

Focusing on SLAM, a main conclusion is that in this family of algorithms, data movement


is a crucial aspect of the achievable performance of a custom hardware accelerator and its

power efficiency. SLAM is data intensive and, at this point in time for digital hardware,

data movement off-chip is one of the most power-consuming and high-latency operations. In

this context, that SLAM ideally requires a combination of specially-designed caches or local

memories storing information accessed in a random manner and dedicated high-bandwidth

streaming direct memory access for the rest of communication.

Since locality is weak and the access patterns are not predictable for the random accesses, tra-

ditional cache-design becomes less efficient. Instead, recognising that this randomly accessed

information is only pixel-intensity information and thus has a smaller size compared to the

rest of the processed data that are accessed in a more predictable manner, a better solution in

terms of performance and efficiency was found to be designing partitioned, multi-port memo-

ries specifically for this information and supporting single-cycle access for square windows of

neighbouring pixels, that are a very common access pattern for both tracking and mapping in

SLAM. This includes on-the-fly gradient calculation operations, finding min, max and abso-

lute in a small area and interpolating pixel intensities around a point in the camera’s frame.

Finally, recognising that the rest of the information is usually accessed in a row-by-row ba-

sis, dedicated I/O units with buffered burst transactions become a natural choice for efficient

high-performance designs.

This is also important for the next main conclusion of this work. Because of the combination of

a high-amount of operations per “Keypoint” for SLAM with non-trivial control flow and vari-

able latency for some tasks, it turns out that an architecture following the dataflow paradigm,

as described here, produces a very efficient hardware design. Separating computation in well-

optimised, self-contained hardware blocks, connected with streaming buffers to each other with

a clear input and output and their own control internally results in a better fit for SLAM.

Building on this, one can design multi-modal, multi-rate and variable latency hardware units,

utilizing the buffered communication to absorb delays and thus matching the variable latency

tasks inherent in semi-dense SLAM. This can offer high performance together with high util-

isation and therefore power efficiency. Instead of treating the applications characteristics as

challenges, in this case they can be embraced as opportunities to improve the flexibility and

6.4. Future Work 169

efficiency of the custom hardware generated.

Last but not least, a design utilising the dataflow paradigm and smaller well-contained hardware

blocks, instead of a monolithic pipeline, offers another advantage. Especially in the light of

high-level synthesis languages, but not limited to those, such a design allows the designer to

better optimise each unit separately and improves the design process, both on the conceptual

and the tool level. Moreover, by using high-level synthesis it becomes possible to enable faster

and easier design-space exploration, by varying the compute unit allocation for each hardware

block and utilising the tools to search for the best scheduling of the necessary operations on

the allocated hardware units.

6.4 Future Work

Lastly, in this section we will discuss ideas, opportunities and research directions for the archi-

tectures presented in this thesis, the field of embedded high-performance low-power SLAM in

general and how the conclusions we discuss here can be applied in the future.

In the short-term, there are certain directions or enhancements that have not been explored yet

in the context of this research, but can provide different benefits to the presented architectures.

Firstly, scalability as discussed above was explored in the context of compute units, which for

the targetted family of devices (Xilinx Zynq) was sufficient to target different off-the-shelf SoCs

and resource budgets. However, especially in the larger context of different FPGA-SoC devices

or ASICs, a separate investigation can take place for different designs to scale the off-chip

memory interfaces, the communication buffers between hardware blocks and the on-chip frame

caches. This will introduce more variables in the design-space exploration that should be able

to enhance the benefits of the scalability techniques we have already proposed.

In the realm of custom hardware, the performance and resource advantages of using reduced

precision and custom data types are well researched. Meanwhile, SLAM algorithms deal with

imprecise data and probabilistic estimates for the environment. Therefore, building on top of

the architectures presented in this thesis, an improvement can be realised by using a mixture of


normal and reduced precision compute units and custom numerical representations. By taking

advantage of domain-specific knowledge, a carefully designed implementation of the above can

further enhance the performance and efficiency of the proposed architectures, while limiting the

drop in accuracy or quality by tailoring them to the sensors and computation used in SLAM.

In the longer term, there are two promising research directions in the view of this author. The

first has to do with using the presented architectures and their derivatives as a base, to explore

the benefits of having state-of-the-art SLAM functionality self-contained on-board an embedded

platform. By combining existing lightweight robotic platforms, such as quadcopters, with the

proposed architecture realised on an FPGA-SoC, it becomes possible to enable autonomous

environment exploration in much smaller and more agile robots than before. Progress in this

front can feed back to developments both in SLAM, and in realising more advanced applications

of this type, opening up exciting research avenues in this direction.

The second direction has to do with the evolution of SLAM to a more complete version of

environment awareness. New research directions combining SLAM with advances in machine

learning and with ideas for multiple levels of map representation, stand to provide leaps of

improvement in the quality and level of environment understanding. For example, using the

depth map of a SLAM algorithm as input together with the camera pixels for a deep convolu-

tional neural network can produce a prediction of the object boundaries and labels for different

objects in the scene. This could then potentially be fed back into SLAM to further refine the

generated depth map and detect outliers with greater precision.

By processing the highly demanding tasks of tracking and mapping on dedicated hardware

with the architectures described here, the general purpose CPU and potentially GPU of an

embedded system are free to be used for other levels of awareness such as scene labelling and

obstacle avoidance. Moreover, advances in reconfigurable hardware can be utilised to design

dedicated hardware for the tasks of machine learning as well, as they become more developed in

the research community, and combine these levels of understanding on a single custom hardware

unit. In this way, by breaking down the boundaries between these different algorithms, large

improvements in efficiency and performance can be gained with research into how to combine

6.4. Future Work 171

the different knowledge representation and how to best take advantage of data locality while

image information streams through the same system-on-chip.

Bibliography

[1] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry for a monocular camera,”

in Proceedings of the IEEE International Conference on Computer Vision, pp. 1449–1456,

2013.

[2] Xilinx, “Zynq-7000 Technical Reference Manual (User Guides).” https://www.xilinx.

com/support/documentation/user_guides/ug585-Zynq-7000-TRM.pdf, 2018. [Online;

accessed Feb-2019].

[3] J. Sturm, K. Konolige, C. Stachniss, and W. Burgard, “Vision-based detection for learning

articulation models of cabinet doors and drawers in household environments,” in Robotics

and Automation (ICRA), 2010 IEEE International Conference on, pp. 362–368, IEEE,

2010.

[4] B. Krose, R. Bunschoten, S. Hagen, B. Terwijn, and N. Vlassis, “Household robots look

and learn: environment modeling and localization from an omnidirectional vision system,”

IEEE Robotics & Automation Magazine, vol. 11, no. 4, pp. 45–52, 2004.

[5] M. Faessler, F. Fontana, C. Forster, E. Mueggler, M. Pizzoli, and D. Scaramuzza, “Au-

tonomous, vision-based flight and live dense 3d mapping with a quadrotor micro aerial

vehicle,” Journal of Field Robotics, vol. 33, no. 4, pp. 431–450, 2016.

[6] A. J. Barry and R. Tedrake, “Pushbroom stereo for high-speed navigation in cluttered

environments,” in Robotics and Automation (ICRA), 2015 IEEE International Conference

on, pp. 3046–3052, IEEE, 2015.

172

BIBLIOGRAPHY 173

[7] M. Saska, J. Chudoba, L. Precil, J. Thomas, G. Loianno, A. Tresnak, V. Vonasek, and

V. Kumar, “Autonomous deployment of swarms of micro-aerial vehicles in cooperative

surveillance,” in Unmanned Aircraft Systems (ICUAS), 2014 International Conference on,

pp. 584–595, IEEE, 2014.

[8] S. Waharte and N. Trigoni, “Supporting search and rescue operations with uavs,” in Emerg-

ing Security Technologies (EST), 2010 International Conference on, pp. 142–147, IEEE,

2010.

[9] C. Zhang and J. M. Kovacs, “The application of small unmanned aerial systems for pre-

cision agriculture: a review,” Precision agriculture, vol. 13, no. 6, pp. 693–712, 2012.

[10] J. Janai, F. Guney, A. Behl, and A. Geiger, “Computer Vision for Autonomous Vehicles:

Problems, Datasets and State-of-the-Art,” arXiv e-prints, Apr. 2017.

[11] S. Lee, S. Lee, and J. J. Yoon, “Illumination-invariant localization based on upward looking

scenes for low-cost indoor robots,” Advanced Robotics, vol. 26, no. 13, pp. 1443–1469, 2012.

[12] B. Vincke, A. Elouardi, and A. Lambert, “Real time simultaneous localization and map-

ping: towards low-cost multiprocessor embedded systems,” EURASIP Journal on Embed-

ded Systems, vol. 2012, no. 1, pp. 1–14, 2012.

[13] J. Sturm, E. Bylow, C. Kerl, F. Kahl, and D. Cremer, “Dense tracking and mapping with

a quadrocopter,” Unmanned Aerial Vehicle in Geomatics (UAV-g), Rostock, Germany,

2013.

[14] M. Blosch, S. Weiss, D. Scaramuzza, and R. Siegwart, “Vision based mav navigation

in unknown and unstructured environments,” in Robotics and automation (ICRA), 2010

IEEE international conference on, pp. 21–28, IEEE, 2010.

[15] H. Wong, V. Betz, and J. Rose, “Comparing fpga vs. custom cmos and the impact on

processor microarchitecture,” in Proceedings of the 19th ACM/SIGDA international sym-

posium on Field programmable gate arrays, pp. 5–14, ACM, 2011.

174 BIBLIOGRAPHY

[16] I. Kuon and J. Rose, “Measuring the gap between fpgas and asics,” IEEE Transactions on

computer-aided design of integrated circuits and systems, vol. 26, no. 2, pp. 203–215, 2007.

[17] J. Engel, T. Schops, and D. Cremers, “LSD-SLAM: Large-scale direct monocular SLAM,”

in European Conference on Computer Vision (ECCV), September 2014.

[18] K. Boikos and C.-S. Bouganis, “Semi-dense SLAM on an FPGA SoC,” in Field Pro-

grammable Logic and Applications (FPL), 2016 26th International Conference on, pp. 1–4,

IEEE, 2016.

[19] K. Boikos and C.-S. Bouganis, “A high-performance system-on-chip architecture for direct

tracking for SLAM,” in Field Programmable Logic and Applications (FPL), 2017 27th

International Conference on, pp. 1–7, IEEE, 2017.

[20] K. Boikos and C.-S. Bouganis, “A Scalable FPGA-Based Architecture for Depth Estima-

tion in SLAM,” in Applied Reconfigurable Computing (C. Hochberger, B. Nelson, A. Koch,

R. Woods, and P. Diniz, eds.), (Cham), pp. 181–196, Springer International Publishing,

2019.

[21] S. Finsterwalder, Eine Grundaufgabe der Photogrammetrie und ihre Anwendung auf Bal-

lonenaufnahmen, von S. Finsterwalder... Die k. Akademie, 1904.

[22] G. Gallego, E. Mueggler, and P. F. Sturm, “Translation of ”zur ermittlung eines ob-

jektes aus zwei perspektiven mit innerer orientierung” by erwin kruppa (1913),” CoRR,

vol. abs/1801.01454, 2018.

[23] P. Sturm, “A historical survey of geometric computer vision,” in Computer Analysis of

Images and Patterns, pp. 1–8, Springer, 2011.

[24] E. Kruppa, Zur Ermittlung eines Objektes aus zwei Perspektiven mit innerer Orientierung.

Holder, 1913.

[25] C. G. Harris, M. Stephens, et al., “A combined corner and edge detector.,” in Alvey vision

conference, no. 50, 1988.

BIBLIOGRAPHY 175

[26] M. Trajkovic and M. Hedley, “Fast corner detection,” Image and vision computing, vol. 16,

no. 2, pp. 75–87, 1998.

[27] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International

journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.

[28] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robust features,” in European

conference on computer vision, pp. 404–417, Springer, 2006.

[29] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to

SIFT or SURF,” in Computer Vision (ICCV), 2011 IEEE international conference on,

pp. 2564–2571, IEEE, 2011.

[30] S. Leutenegger, M. Chli, and R. Y. Siegwart, BRISK: Binary robust invariant scalable

keypoints. IEEE, 2011.

[31] H. Jin, P. Favaro, and S. Soatto, “Real-time 3d motion and structure of point features: a

front-end system for vision-based control and interaction,” in Proceedings IEEE Conference

on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), vol. 2,

pp. 778–779, IEEE, 2000.

[32] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM: Real-time single

camera SLAM,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 6,

pp. 1052–1067, 2007.

[33] G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in

Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Aug-

mented Reality, pp. 1–10, IEEE Computer Society, 2007.

[34] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendon-Mancha, “Visual simultaneous

localization and mapping: a survey,” Artificial Intelligence Review, vol. 43, no. 1, pp. 55–

81, 2015.

176 BIBLIOGRAPHY

[35] Z. Lu, Z. Hu, and K. Uchimura, “SLAM estimation in dynamic outdoor environments: A

review,” in International Conference on Intelligent Robotics and Applications, pp. 255–267,

Springer, 2009.

[36] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, and J. McDonald, “Real-

time large-scale dense rgb-d slam with volumetric fusion,” The International Journal of

Robotics Research, vol. 34, no. 4-5, pp. 598–626, 2015.

[37] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leutenegger, “Elastic-

fusion: Real-time dense slam and light source estimation,” The International Journal of

Robotics Research, vol. 35, no. 14, pp. 1697–1716, 2016.

[38] J. McCormac, R. Clark, M. Bloesch, A. Davison, and S. Leutenegger, “Fusion++: Volu-

metric object-level slam,” in 2018 International Conference on 3D Vision (3DV), pp. 32–

41, IEEE, 2018.

[39] A. Handa, R. A. Newcombe, A. Angeli, and A. J. Davison, “Real-time camera tracking:

When is high frame-rate best?,” in European Conference on Computer Vision, pp. 222–235,

Springer, 2012.

[40] J. Engel, V. Usenko, and D. Cremers, “A photometrically calibrated benchmark for monoc-

ular visual odometry,” in arXiv:1607.02555, July 2016.

[41] R. Mur-Artal, J. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accurate monocular

slam system,” Robotics, IEEE Transactions on, vol. 31, no. 5, pp. 1147–1163, 2015.

[42] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and map-

ping in real-time,” in Computer Vision (ICCV), 2011 IEEE International Conference on,

pp. 2320–2327, IEEE, 2011.

[43] J. Stuhmer, S. Gumhold, and D. Cremers, “Real-time dense geometry from a handheld

camera,” in Joint Pattern Recognition Symposium, pp. 11–20, Springer, 2010.

[44] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE transactions on

pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611–625, 2018.

BIBLIOGRAPHY 177

[45] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual–

inertial odometry using nonlinear optimization,” The International Journal of Robotics

Research, vol. 34, no. 3, pp. 314–334, 2015.

[46] A. Handa, T. Whelan, J. McDonald, and A. J. Davison, “A benchmark for RGB-D vi-

sual odometry, 3D reconstruction and SLAM,” in 2014 IEEE international conference on

Robotics and automation (ICRA), pp. 1524–1531, IEEE, 2014.

[47] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision. Cambridge

University Press, 2003.

[48] J. J. More, “The levenberg-marquardt algorithm: implementation and theory,” in Numer-

ical analysis, pp. 105–116, Springer, 1978.

[49] G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “g2o: A general framework for

graph optimization,” in IEEE International Conference on Robotics and Automation, 2011.

[50] P. Greisen, S. Heinzle, M. Gross, and A. P. Burg, “An FPGA-based processing pipeline for

high-definition stereo video,” EURASIP Journal on Image and Video Processing, vol. 2011,

no. 1, p. 18, 2011.

[51] A. Cornu, S. Derrien, and D. Lavenier, “HLS tools for FPGA: Faster development with

better performance,” in International Symposium on Applied Reconfigurable Computing,

pp. 67–78, Springer, 2011.

[52] B. Vincke, A. Elouardi, A. Lambert, and A. Merigot, “Efficient implementation of ekf-slam

on a multi-core embedded system,” in IECON 2012-38th Annual Conference on IEEE

Industrial Electronics Society, pp. 3049–3054, IEEE, 2012.

[53] S. Lee and S. Lee, “Embedded visual SLAM: Applications for low-cost consumer robots,”

IEEE Robotics & Automation Magazine, vol. 20, no. 4, pp. 83–95, 2013.

[54] R. Voigt, J. Nikolic, C. Hurzeler, S. Weiss, L. Kneip, and R. Siegwart, “Robust embedded

egomotion estimation,” in 2011 IEEE/RSJ International Conference on Intelligent Robots

and Systems, pp. 2694–2699, IEEE, 2011.

178 BIBLIOGRAPHY

[55] S. Weiss, M. W. Achtelik, S. Lynen, M. Chli, and R. Siegwart, “Real-time onboard

visual-inertial state estimation and self-calibration of mavs in unknown environments,”

in Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 957–

964, IEEE, 2012.

[56] M. Sanfourche, V. Vittori, and G. Le Besnerais, “evo: A realtime embedded stereo odom-

etry for mav applications,” in 2013 IEEE/RSJ International Conference on Intelligent

Robots and Systems, pp. 2107–2114, IEEE, 2013.

[57] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct monocular visual odome-

try,” in Robotics and Automation (ICRA), 2014 IEEE International Conference on, pp. 15–

22, IEEE, 2014.

[58] J. Engel, J. Sturm, and D. Cremers, “Scale-aware navigation of a low-cost quadrocopter

with a monocular camera,” Robotics and Autonomous Systems, vol. 62, no. 11, pp. 1646–

1656, 2014.

[59] T. Schops, J. Engel, and D. Cremers, “Semi-dense visual odometry for AR on a smart-

phone,” in International Symposium on Mixed and Augmented Reality, September 2014.

[60] A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, and V. Sze, “Navion: A fully integrated

energy-efficient visual-inertial odometry accelerator for autonomous navigation of nano

drones,” IEEE Symposium on VLSI Circuits, 2018.

[61] J. Weberruss, L. Kleeman, D. Boland, and T. Drummond, “FPGA acceleration of mul-

tilevel ORB feature extraction for computer vision,” in Field Programmable Logic and

Applications (FPL), 2017 27th International Conference on, pp. 1–8, IEEE, 2017.

[62] J. Nikolic, J. Rehder, M. Burri, P. Gohl, S. Leutenegger, P. T. Furgale, and R. Siegwart,

“A synchronized visual-inertial sensor system with fpga pre-processing for accurate real-

time slam,” in 2014 IEEE international conference on robotics and automation (ICRA),

pp. 431–437, IEEE, 2014.

[63] D. Bouris, A. Nikitakis, and I. Papaefstathiou, “Fast and efficient FPGA-based feature

detection employing the SURF algorithm,” in Field-Programmable Custom Computing

BIBLIOGRAPHY 179

Machines (FCCM), 2010 18th IEEE Annual International Symposium on, pp. 3–10, IEEE,

2010.

[64] L. Yao, H. Feng, Y. Zhu, Z. Jiang, D. Zhao, and W. Feng, “An architecture of optimised sift

feature detection for an fpga implementation of an image matcher,” in Field-Programmable

Technology, 2009. FPT 2009. International Conference on, pp. 30–37, IEEE, 2009.

[65] C. Banz, S. Hesselbarth, H. Flatt, H. Blume, and P. Pirsch, “Real-time stereo vi-

sion system using semi-global matching disparity estimation: Architecture and FPGA-

implementation,” in 2010 International Conference on Embedded Computer Systems: Ar-

chitectures, Modeling and Simulation, pp. 93–101, IEEE, 2010.

[66] T. M. Howard, A. Morfopoulos, J. Morrison, Y. Kuwata, C. Villalpando, L. Matthies, and

M. McHenry, “Enabling continuous planetary rover navigation through fpga stereo and

visual odometry,” in Aerospace Conference, 2012 IEEE, pp. 1–9, IEEE, 2012.

[67] D. Honegger, H. Oleynikova, and M. Pollefeys, “Real-time and low latency embedded

computer vision hardware based on a combination of fpga and mobile cpu,” in Intelligent

Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference on, pp. 4930–

4935, IEEE, 2014.

[68] J. Sturm, W. Burgard, and D. Cremers, “Evaluating egomotion and structure-from-motion

approaches using the TUM RGB-D benchmark,” in Proc. of the Workshop on Color-Depth

Camera Fusion in Robotics at the IEEE/RJS International Conference on Intelligent Robot

Systems (IROS), 2012.

[69] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and

R. Siegwart, “The EuRoC micro aerial vehicle datasets,” The International Journal of

Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016.

[70] Intel, “Intel Core i7-4700MQ Processor product specification.”

https://ark.intel.com/content/www/us/en/ark/products/75117/

intel-core-i7-4700mq-processor-6m-cache-up-to-3-40-ghz.html, 2013. [On-

line; accessed Mar-2019].

180 BIBLIOGRAPHY

[71] A. J. Barry, H. Oleynikova, D. Honegger, M. Pollefeys, and R. Tedrake, “Fast onboard

stereo vision for uavs,” in Vision-based Control and Navigation of Small Lightweight UAV

Workshop, International Conference On Intelligent Robots and Systems (IROS), 2015.

[72] H. Oleynikova, D. Honegger, and M. Pollefeys, “Reactive avoidance using embedded stereo

vision for mav flight,” in 2015 IEEE International Conference on Robotics and Automation

(ICRA), pp. 50–56, IEEE, 2015.

[73] Microsoft, “HoloLens (1st gen) hardware details wiki.” https://docs.microsoft.com/

en-us/windows/mixed-reality/hololens-hardware-details, 2018. [Online; accessed

March-2019].

[74] V. Developers, “Callgrind: a call-graph generating cache and branch prediction profiler.”

http://valgrind.org/docs/manual/cl-manual.html, 2018. [Online; accessed March-

2019].

[75] D. C. Jakob Engel, “LSD-SLAM: Large-Scale Direct Monocular SLAM. Code and

Datasets.” https://vision.in.tum.de/research/vslam/lsdslam, 2015. [Online; ac-

cessed Mar-2019].

[76] Xilinx, “Zynq-7000 SoC Data Sheet: Overview.” https://www.xilinx.com/support/

documentation/data_sheets/ds190-Zynq-7000-Overview.pdf, 2018. [Online; accessed

Jan-2019].

Custom hardware architectures for embedded high-performance and low-power SLAM · 2020. 2. 5. ·...

Documents

Transcript of Custom hardware architectures for embedded high-performance and low-power SLAM · 2020. 2. 5. ·...