custom hardware architectures for embedded high-performance and low-power slam · 2020. 2. 5. ·...
TRANSCRIPT
Imperial College of Science, Technology and Medicine
Department of Electrical and Electronic Engineering
Custom hardware architectures for embeddedhigh-performance and low-power SLAM
Konstantinos Boikos
Supervised by Christos-Savvas Bouganis
Submitted in part fulfilment of the requirements for the degree ofDoctor of Philosophy in Electrical and Electronic Engineering of Imperial College and
the Diploma of Imperial College, April 2019
Abstract
Simultaneous localisation and mapping (SLAM) is central to many emerging applications such
as autonomous robotics and augmented reality. These require an accurate and information
rich reconstruction of the environment which is not provided by the current state-of-the-art in
embedded SLAM which focuses on sparse, feature-based methods. SLAM needs to be performed
at real-time, with a low latency. At the same time, dense SLAM that can provide a high
level of reconstruction quality and completeness comes with high computational and power
requirements, while platforms in the embedded space often come with significant power and
weight constraints.
Towards overcoming this challenge, this thesis presents FPGA-based custom hardware archi-
tectures that offer significantly higher performance than general purpose embedded hardware
for SLAM, but with the same low-power requirements. The works begins by discussing the
characteristics and computational patterns of this type of application, focusing on a state-of-
the-art semi-dense direct SLAM algorithm. Then custom hardware architectures are presented
and evaluated as they emerged from this research work. These combine many novel features to
achieve a performance on-par with optimised software on a high-end multicore desktop CPU
but with more than an order-of-magnitude better performance-per-watt.
The two high-performance, power-efficient architectures for the two interdependent tasks that
comprise the core of real-time SLAM, are designed to work alongside a mobile CPU running
a full operating system, and scale in terms of resources to provide a solution that can be
adapted to most off-the-shelf FPGA-SoCs. Thus, as well as offering the necessary performance
and performance-per-watt to enable advanced semi-dense SLAM on mobile power-constrained
platforms, they stand to bridge the gap between custom hardware and research in algorithms
and robotic vision as they can be adapted and re-used more easily than traditional custom
hardware architectures.
i
ii
Acknowledgements
Firstly, I would like to thank my supervisor, Dr. Christos Bouganis, for his trust, guidance
and support during these years at Imperial College. Our meetings and discussions allowed me
to develop my abilities and confidence as a researcher, helped me develop my critical thinking
and most importantly helped me get a new perspective all the times that this journey towards
a PhD felt too complicated and overwhelming.
I would also like to thank everyone in the Circuits and Systems group for all the interesting
conversations in the lab, for introducing me to topics I would not have discovered alone and
for their feedback on my work on multiple occasions.
A special thanks to Stelios and Alexandros. Without our conversations, technical and not,
ideas shared, and everything else inside and outside the lab this PhD really would not have
been the same.
I would like to express my gratitude to Nikos, Christos, Rafaella and all the other amazing
people from back home, for always believing in me and supporting me. To Miriam; thank you
for everything these years. There is too much to fit in this page. Without your support and
love I would not be where I am now.
Furthermore, I would like to express my love and gratitude to my grandparents Konstantinos,
Voula and Kalliopi for bringing me up, for all their love and for teaching me all those things
about life that grandparents are always better at knowing. Finally, to my parents Nikos and
Fotini; my deepest love and gratitude for always being there and supporting me, emotionally
and practically, in every way you could, and for your love throughout my life.
iii
iv
Declarations
Declaration of Originality
I herewith certify that the work presented in this thesis is my original own. All material in this
thesis which is not my own work has been appropriately referenced.
Declaration of Copyright
The copyright of this thesis rests with the author and is made available under a Creative
Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy,
distribute or transmit the thesis on the condition that they attribute it, that they do not use it
for commercial purposes and that they do not alter, transform or build upon it. For any reuse
or redistribution, researchers must make clear to others the licence terms of this work.
v
vi
‘It is well known that a vital ingredient of success is not knowing that what you’re attemptingcan’t be done’
Sir Terry Pratchett
vii
viii
Contents
Abstract i
Acknowledgements ii
Declarations v
1 Introduction 1
1.1 Machine Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Terminology and concepts in the field of SLAM . . . . . . . . . . . . . . . . . . 4
1.2.1 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 SLAM operation and quality metrics . . . . . . . . . . . . . . . . . . . . 7
1.2.4 Independent variables for SLAM and their effect . . . . . . . . . . . . . . 9
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Aims and Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Research Contributions and Statement of Originality . . . . . . . . . . . . . . . 17
1.7 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
ix
x CONTENTS
2 Background 19
2.1 A brief history of SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Principles of state of the art SLAM . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Direct semi-dense SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Algorithmic overview of LSD-SLAM . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Tracking in LSD-SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Mapping in LSD-SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7 Proposed Architectures and FPGA-SoCs . . . . . . . . . . . . . . . . . . . . . . 44
2.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3 Accelerating semi-dense SLAM 59
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2 Tracking Algorithm in LSD-SLAM . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.1 Tracking in LSD-SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2.2 The tracking algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2.3 Mapping and Global Optimisation . . . . . . . . . . . . . . . . . . . . . 70
3.3 Profiling and Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.1 Profiling Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.2 Timing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4 Accelerator architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.1 System architecture and control . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.2 Residual and Weight calculation Unit . . . . . . . . . . . . . . . . . . . . 81
3.4.3 Linear System Generation - Jacobian Update Unit . . . . . . . . . . . . . 89
CONTENTS xi
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4 Accelerating Tracking for SLAM 100
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2.2 Direct Tracking Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2.3 Frame Cache Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.3.1 Custom Core Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.3.2 Resource Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.3.3 Running as part of a SLAM Pipeline . . . . . . . . . . . . . . . . . . . . 119
4.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5 Accelerating Mapping for SLAM 124
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.2 Mapping Algorithm - LSD-SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.3 Architecture of the Mapping Accelerator . . . . . . . . . . . . . . . . . . . . . . 128
5.3.1 Coprocessor architecture and FPGA-SoCs . . . . . . . . . . . . . . . . . 128
5.3.2 High-level Architecture Overview and Functionality . . . . . . . . . . . . 131
5.3.3 Multi-rate dataflow operation . . . . . . . . . . . . . . . . . . . . . . . . 141
5.3.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.4.2 Benchmark selection and Platforms . . . . . . . . . . . . . . . . . . . . . 148
5.4.3 Design Implementation and Resource Usage . . . . . . . . . . . . . . . . 148
5.4.4 Performance and Power Comparison . . . . . . . . . . . . . . . . . . . . 150
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.5.1 Achievements of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6 Conclusions and Future Work 159
6.1 Lessons learnt designing with HLS and FPGA-SoCs . . . . . . . . . . . . . . . . 160
6.2 Generalisation of the presented research . . . . . . . . . . . . . . . . . . . . . . 163
6.3 Research Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Bibliography 171
xii
List of Tables
2.1 State-of-the-art SLAM examples. Compiled with a focus on features and char-
acteristics of different solutions to demonstrate the breadth of the field in terms
of features and power typical or reported where available power requirements.
Comparison with camera resolution in the same region of MPixels. . . . . . . . . 58
3.1 Profiling Results - Callgrind / x86 Intel CPU . . . . . . . . . . . . . . . . . . . . 74
3.2 Timing Results - Intel i7 - 4770 @ 3.77 GHz . . . . . . . . . . . . . . . . . . . . 76
3.3 Timing Results - ARM Cortex-A9 @ 667 MHz . . . . . . . . . . . . . . . . . . . 76
3.4 Datasets used, provided on TUM’s website by the authors of LSD-SLAM . . . . 94
3.5 FPGA Resources. The first two columns represent the resources post synthesis,
where the tool has allocated a certain number of resources at each instantiated
hardware unit . Post-implementation, Vivado uses various optimisations to re-
duce usage by combining circuits or simplifying various units. . . . . . . . . . . 94
5.1 Resources post-implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.2 State-of-the-art SLAM examples. Compiled with a focus on features and char-
acteristics of different solutions to demonstrate the breadth of the field. This is
a simpler version of the Table in Chapter2. . . . . . . . . . . . . . . . . . . . . . 155
xiii
xiv
List of Figures
1.1 SLAM Continuum from Sparse to Dense. A more complex and globally consis-
tent map significantly increases computation requirements. . . . . . . . . . . . . 3
1.2 In a moving platform or camera, the performance of SLAM will directly affect
the accuracy of tracking and the accuracy and quality of the reconstruction, and
in extreme cases lead to tracking loss. The map on the right is on a system
that can deliver a 4× improvement in performance compared to the left. The
map on the left has accumulated a very large error from skipping frames due to
performance constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Even if the slower system can keep up, improved performance is crucial to im-
prove quality and accuracy under fast movement. The map on the left was
recovered on a system with a performance deficit of 2× compared to the one on
the right . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 SLAM algorithmic overview. The main tasks are inside the two dashed rectan-
gles. The main data structures are indicated with orange coloured boxes, while
the light blue boxes represent the main tasks necessary to perform tracking and
mapping. With grey we indicate some of the background tasks involved in full
SLAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Figure adapted from [1] to demonstrate tracking with direct Keyframe alignment
on sim(3) utilizing the estimated depth map. We can see from two camera views
the current state of the map on the left as a collection of inverse depth and depth
variance values for the mapped points and the photometric residual on the top
right as a result of the reprojection. . . . . . . . . . . . . . . . . . . . . . . . . . 36
xv
xvi LIST OF FIGURES
2.3 Pyramid processing. Starting at a lower, coarser resolution increases the radius
of convergence for the optimisation and improves the speed of reaching a minimum 40
2.4 Epipolar geometry, the epipolar line is depicted with orange colour . . . . . . . . 42
2.5 From [1], the top row is different camera frames overlaid with the estimated
semi-dense inverse depth map. The bottom row contains the camera view as a
pyramid with blue edges, with its associated trajectory as a line in front of a 3D
view of the tracked scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6 Concept heterogeneous architecture block diagram . . . . . . . . . . . . . . . . . 45
2.7 FPGA architecture. Dedicated SRAM memories and DSP blocks with capable
hardened multipliers have significantly improved the efficiency of frequently used
and traditionally costly operations . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.8 A modern FPGA fabric is organised in clock domains made up of slices with the
edges of the silicon usually housing communication circuits and ports. . . . . . 47
2.9 Zynq 7-series FPGA-SoC and Interconnect. Only connections relevant to the
architectures researched in this thesis are included for clarity of presentation.
Source: Zynq-7000 TRM [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1 Control flow of tracking task. The three main sub-functions are described inside
the dashed lines. The rest of the computation and control described in the figure
happens outside these functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2 SLAM algorithmic overview. The main tasks are inside the two dashed rectan-
gles. The main data structures are indicated with orange coloured boxes, while
the light blue boxes represent the main tasks necessary to perform tracking and
mapping. With grey we indicate some of the background tasks involved in full
SLAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 Percentage of total computation for the tasks comprising SLAM . . . . . . . . . 74
3.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.5 Accelerated tracking and mapping execution in software . . . . . . . . . . . . . 79
LIST OF FIGURES xvii
3.6 Intensity value interpolation for a projected point using its 4 neighbouring pixels 82
3.7 Residual and Weight Calculation Unit . . . . . . . . . . . . . . . . . . . . . . . 82
3.8 Tracking involves projecting a map point, with a recovered inverse depth 1/z to
the image plane in the current camera frame . . . . . . . . . . . . . . . . . . . . 83
3.9 Residual Calculation Pipeline - Pixel Re-projection . . . . . . . . . . . . . . . . 84
3.10 Pixel Gradient and Intensity Interpolation . . . . . . . . . . . . . . . . . . . . . 88
3.11 Element Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.12 Different scenes will generate Keyframes with fewer or more mapped points,
which affects the runtime both in Hardware and in Software . . . . . . . . . . . 92
3.13 Average performance in frames per second for two different sequences selected
from the datasets in Table 3.4. The grey colour corresponds to a more dense,
complex scene where a higher number of points are mapped and used for tracking.
Examples of two scenes that would generate maps with a different number of
points are in Fig.3.12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.14 Memory transfer cost in microseconds to synchronise the map and frame with
the hardware buffers on the DRAM. This happens once per pyramid level. . . . 95
3.15 Software and hardware timing for individual functions. These are run multiple
times until the error stops decreasing significantly, at which point the process
is repeated for coarse-to-fine pyramid levels until the penultimate level (once
subsampled from the original image) is reached . . . . . . . . . . . . . . . . . . 96
3.16 In a moving platform or camera, the performance of SLAM will directly affect
the accuracy of tracking and the accuracy and quality of the reconstruction, and
in extreme cases lead to tracking loss. The map on the right is on a system
that can deliver a 4× improvement in performance compared to the left. The
map on the left has accumulated a very large error from skipping frames due to
performance constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
xviii LIST OF FIGURES
3.17 Even if the slower system can keep up, improved performance is crucial to im-
prove quality and accuracy under fast movement. The map on the left was
recovered on a system with a performance deficit of 2× compared to the one on
the right . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2 Tracking Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.3 Frame Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.4 Frame Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.5 Processing Time Per Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.6 Total Time per Level including Memory Copy . . . . . . . . . . . . . . . . . . . 118
4.7 Performance comparison of this accelerator with our previous work presented in
Chapter 3 and with the NEON-accelerated software on the ARM Cortex-A9 as
baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.8 Performance/Resource Scaling targeting a larger FPGA . . . . . . . . . . . . . . 120
5.1 Zynq 7-series FPGA-SoC and Interconnect. Only connections relevant to the
architectures researched in this thesis are included for clarity of presentation. . . 130
5.2 Block diagram of the accelerator architecture . . . . . . . . . . . . . . . . . . . . 132
5.3 Sliding window over current keyframe . . . . . . . . . . . . . . . . . . . . . . . . 134
5.4 Sliding window utilizes shift registers and 4 row buffers . . . . . . . . . . . . . . 134
5.5 The intensity gradient is calculated in the two image directions, for the target
pixel and its four immediate neighbours, resulting in a total of 13 accesses, 20
gradient calculations and a final reduce operation to find the maximum value in
the region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.6 Epipolar geometry, the epipolar line is depicted with orange colour. While the
point will lie on the line, it does not have to appear on the camera’s frame as it
can lie outside that plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.7 For each comparison, the previous four values are re-used and a new one is
calculated by interpolating the values of the four pixels surrounding the floating
point coordinates of the next scan point . . . . . . . . . . . . . . . . . . . . . . 139
5.8 In this case, intensity information, combined with a large baseline in absense of
a strong previous estimate, is insufficient to provide a good match. . . . . . . . . 140
5.9 The units in the fast-rate pipeline operate at two rates simultaneously, with some
control processes, initialization and communication with the rest of the pipeline
happening only when starting or finishing a scan and and the main compute
units operating at a rate of one scan step per cycle. . . . . . . . . . . . . . . . 143
5.10 Heatmap of Depth Map valid points for epipolar scan. Axes represent image
coordinates, with colour representing the frequency of a Keypoint in those coor-
dinates requiring a scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.11 Resource scaling with architectural tuning targetting 100 MHz . . . . . . . . . . 150
5.12 Mapping performance in msecs - Different Platforms / Datasets . . . . . . . . . 152
5.13 Power consumption of the devices tested. Here, “This work” refers to the com-
bined power of tracking and mapping accelerators and the CPU operating for
the background tasks of SLAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
xix
xx
Chapter 1
Introduction
1.1 Machine Vision
Building machines with the ability to see and understand the space around them has interested
researchers for decades. At the same time, it is a complex task that, while appearing effortless
for humans, has proven very hard to tackle. In recent years, there has been an explosion
of interest and research effort surrounding intelligent machines and systems. One area of
particular interest is the push towards fully autonomous machines that can move and interact
in an unknown environment. A self-driving car for example, needs continuous awareness of
the environment around it, both static (road surface, lane markings) and dynamic (e.g. other
vehicles) to be able to operate safely and autonomously. Lately, there has been significant
research and progress on this front, utilizing advances in algorithms, sensor and computing
platforms to achieve important milestones towards a system that can continuously generate a
map of what is around it and track itself in that environment.
This type of system is a necessary part for many emerging applications. One example is
household assistance robots e.g. [3, 4] that could finally gain the capability to handle complex
objects that are articulated or deform and to navigate autonomously in a challenging and
dynamic three dimensional environment. A similar line of research is leading to environment-
aware industrial robots that can operate alongside human operators in a much safer way, even
1
2 Chapter 1. Introduction
in co-operative situations. Smaller flying machines are gaining the capability for autonomous
indoors and outdoors flight and exploration, both in quadcopter form [5] and as a fixed-wing
aircraft [6]. These can provide the capability for rapid exploration of large spaces using swarms
[7], much more effective search and rescue operations [8] and precision agriculture used in
conjunction with other vision techniques [9]. Lately, we have witnessed a very strong push
towards fully autonomous cars from both private companies, and academic efforts [10]. Finally,
the field of autonomous robots is closely linked with augmented reality applications which have
to continuously track a user’s location, orientation and movement at a very low latency and
maintain an accurate map of the user’s surroundings.
One of the core elements in this effort is a family of algorithms and systems called Simultaneous
Localisation and Mapping (SLAM), which aims to provide a solution to the problem of exploring
an unknown environment while continuously keeping track of the system’s own position. SLAM
has to integrate a series of observations of an environment from different sensors, using its own
estimated positions, to incrementally build and maintain a map of that environment. The
definition of SLAM has been used in the past to describe many efforts, including slow and
progressive structure-from-motion methods and 2D maps utilizing sensors such as Sonar and
Lidar. However, the forefront of this field and especially to do with autonomous systems that
have to operate in a complex, dynamic environment has to do with systems utilizing a visual
sensor and operating on a moving platform. Thus, this thesis will concentrate on real-time
visual SLAM, which refers to performing all processing live, at the camera’s rate of operation
or framerate.
SLAM in the literature is usually comprised of two main tasks. Localisation, often referred to as
tracking, is the act of continuously estimating the 6-degree-of-freedom position and orientation
(pose), of the camera. Mapping is the task of generating and continuously updating a coherent
model of the environment based on the sensor observations. These two tasks are very closely
interconnected and strongly dependent on each other. Tracking, closes the loop to the mapping
task by comparing the incoming data from the sensor with the map that has been generated
to estimate a current pose. Then, the accuracy of that estimation will determine the quality of
the next map update and how close it will be to reality.
1.1. Machine Vision 3
Sources: ORB SLAM (R. Mur-Artal), LSD-SLAM (J. Engels et al.), ElasticFusion (T. Whelan et al.)
Sparse Semi-Dense Dense SLAM
Mobile CPU High-end Desktop GPU Acceleration
Figure 1.1: SLAM Continuum from Sparse to Dense. A more complex and globally consistentmap significantly increases computation requirements.
Towards addressing the challenges of real-time visual SLAM, a number of solutions have
emerged and the field has gradually generated different approaches, each with their own ad-
vantages and disadvantages. A main categorisation is in terms of map density, describing how
many observations the algorithm deals with. The categories that emerge from this are Sparse,
Dense and lately Semi-dense SLAM. Sparse SLAM uses a small set of observations for Track-
ing and maintains a sparse map of the environment consisting of some observed 3D points of
interest. These approaches exhibit lower computational requirements for a similar accuracy
in estimating the position of the camera in the environment but are limited on the density
and “richness” of the environment’s reconstruction. At the other end of the spectrum, SLAM
algorithms categorised as Dense are now able to construct a complete high quality model of the
environment usually as interconnected surfaces. At the same time they are significantly more
computationally intensive.
To address this drawback a family of works described as Semi-dense SLAM have emerged.
These aim to provide a more dense and information-rich representation compared to sparse
methods, while achieving better computational efficiency through processing a subset of high
quality observations and attempting to reconstruct an as complete model of an environment as
possible from these. However, they are still computationally complex and target desktop-grade
multicore CPUs for real-time processing. In Fig.1.1 we can see three state-of-the-art examples
positioned in this continuum of sparse to dense.
4 Chapter 1. Introduction
At the same time, combined with the high computational requirements and the low latency
necessary for emerging applications, many platforms come with constraints on the power and
weight they can support. Quadcopters and other UAVs and ground robots impose significant
constraints on both power and weight while other applications that require some form of SLAM
such as augmented reality face even stricter constraints. This has resulted in a large gap
currently between research in SLAM algorithms and SLAM implementations in embedded
platforms. A main cause of this is the large gap in computational capabilities between high-
end GPUs and desktop CPUs and those found in low-power mobile devices or embedded SoCs.
Finally, it is important to note that in the state of the art many dense methods utilize specialised
sensors that can recover an estimate of the depth directly from the visible scenes. The main
example is Kinect-type sensors, projecting an infrared pattern that, together with a special
infrared camera and ASIC combination, give a depth estimate for all objects up to a few
meters away from the camera. This type of sensor, which we will discuss further in Chapter 2,
has enabled very high quality results, but requires higher power consumption, is heavier and
it is constrained in depths of a few meters making it unsuitable for outdoor spaces or large
environments. For this work, low power and weight characteristics are an essential target, to
enable the emerging applications discussed. Additionally, many robotic platforms have to be
able to operate in large or outdoor spaces. Thus, we will focus on works that utilize passive,
monocular cameras, since they are a power efficient and lightweight solution, and can work well
in a variety of environments.
1.2 Terminology and concepts in the field of SLAM
This section will outline and establish the main concepts and terminology that are used through-
out the literature of SLAM and this thesis.
1.2. Terminology and concepts in the field of SLAM 5
1.2.1 Input data
First we shall define the form of data that a SLAM algorithm deals with, when applied in the
context of a SLAM system (e.g. a robot with a set of sensors for localisation and mapping). The
input of a SLAM algorithm is a series of measurements of the environment in which it is applied,
optionally combined with a secondary localisation system using a series of measurements of the
system’s position, rotation or acceleration. The precise form of environment measurements will
vary for different solutions depending on the sensors used, but it will be one or a combination
by the following:
– A series of images from a camera capturing light intensity and optionally including colour
information. When referring to images captured from a camera in the context of SLAM
as part of a sequence, the word Frame is frequently used to describe one of the captured
images in the sequence.
– A series of depth measurements in an area around the system using depth-sensing
sensors such as a sonar or time-of-flight sensor.
– A series of images (or frames) captured from an RGB-D sensor combining both of the
above in an integrated image with depth estimates for all or part of the field of view.
For a SLAM algorithm the above input is used as a sequence of observations. Focusing on
a sequence of images captured from a camera, which is the main input in most of the works
discussed in this thesis, each image is characterised by the following:
– Pixels. Each pixel (originally picture element) is the smallest addressable element of a
digital image sensor, providing a light intensity I (and optionally colour) measurement
for a part of the field of view. There are hundreds of thousands to millions of pixels
composing an image from a modern digital image sensor.
– Height and width, specified as the number of pixels in each dimension.
6 Chapter 1. Introduction
– Resolution, defined as the total number of pixels per image and obtained from the
product of height and width.
When referring to image pixels in a captured image we shall use their position in terms of
vertical and horizontal dimensions, ranging from (0, 0) at the upper left corner of the sensor
array to (width − 1, height − 1) at the bottom right. Zero-based numbering such as this is
frequently used, since it directly corresponds to an offset from an initial pixel position. As such,
it matches the forms of addressing usually employed in such systems.
1.2.2 Output
For an input of a series of frames, the output of a SLAM algorithm consists of a generated
map of the environment around the system and an estimate of the sensor’s (and by extension
the system’s) pose.
The recovered pose produced by the SLAM system is an estimate of the position and orientation
of the system, at each captured frame, in relation to the origin point in a process known as
tracking. To recover the pose, the information in the input (a captured frame) needs to be
compared to and matched with known points of the environment and optimised against them.
These known points of reference are part of the generated map of the system.
A map for the SLAM system is a representation of a generated model of the environment.
It differs between SLAM systems, as we will discuss more in the following two chapters, but
for most algorithms is composed of a set of points, where a point is an observed part of the
environment in terms of a single pixel or a set of neighbouring pixels. The actual size of a
“point” in the real world is variable and depends on camera parameters such as focal length
and the distance from the object observed.
A point is composed of its position coordinates in three-dimensional space (x, y, z) in relation
to an origin point (0, 0, 0) and other identifying information. This information may be just the
light intensity and colour of that part of the environment, but could include other descriptors.
1.2. Terminology and concepts in the field of SLAM 7
For example, in state-of-the-art feature-based SLAM, a pixel’s intensity is compared with its
neighbours in a small area. The results of those comparisons are stored as a set of binary digits
to describe that local pattern. The origin point is conventionally chosen to be the position of
the system at the time of the first captured frame.
The map is generated by extracting information from the sequence of captured images. Algo-
rithms that attempt to generate a more complete reconstruction of the environment use surface
elements (surfels), volume elements (voxels) or other forms of interconnected surfaces. These
different representations are an active research area in an effort to generate an efficient but
complete model of the environment around a system.
In SLAM algorithms, the more complete the reconstruction the less information is stored per ele-
ment. SLAM algorithms that require a few hundreds or thousands of points in three-dimensional
space to compose a map, frequently employ complex descriptors for a small area around each
point to guarantee accuracy when matching the same point across different views. On the other
hand SLAM algorithms that attempt a full surface reconstruction rely exclusively on the light
intensity or colour of the recovered surfaces and track against most of the visible portions of
the map at the same time.
1.2.3 SLAM operation and quality metrics
In a typical SLAM system processing always begins from the flow of information at the input.
The system captures a frame from the camera and begins matching and comparing the captured
information with the model of the environment that has been generated (the map). This
matching and comparison forms part of an optimisation process towards estimating the pose
of the camera that minimizes the error between the model and the actual observations.
This pose estimate at the end of the optimisation process is then used as the input of a mapping
process that attempts to integrate the newly captured information to improve the accuracy of
the current map and add new pieces of information in it. As such, the main quality metrics
used for SLAM have to do either with accuracy or performance. Regarding the accuracy, these
8 Chapter 1. Introduction
metrics are either about the accuracy of the successive pose estimates (one for each captured
frame) or with the accuracy of the generated map of the environment.
The accuracy of the pose is quantified as the distance (error) between the real-world three-
dimensional position of the sensor in relation the origin point and the estimated position of the
recovered pose. To capture the behaviour of the algorithm across a dataset, one of the most
frequently used quality metrics is the root mean squared error of the poses that comprise the
entire captured trajectory. It gives a good estimate of the average behaviour of an algorithm
while also penalising large errors even if they are encountered on fewer observations. Other
metrics can be used around the accuracy of the pose estimates as well, such as the rotational
error (error in the orientation estimates of the camera in terms of degrees) or loop closure
errors (the distance between the estimate and the real-world position when returning to the
same world point).
As opposed to tracking accuracy, the quality of reconstruction has been less frequently used in
published literature to compare competing algorithms. However, this is gradually starting to
change in recent works. It is usually discussed qualitatively, as opposed to using a quantitative
metric, even in state of the art methods, in terms of the completeness of the map and the
amount of visible distortion in comparison to the actual environment captured.
The performance metrics for a SLAM algorithm have to do with latency and throughput.
Latency, defined as the time between the arrival of new input data and output data being
ready, is important in fast and agile platforms such as quadcopters or other autonomous aircraft
where the pose and map may be used online for tasks such as obstacle avoidance. Relevant
latency measurements are:
– From the moment of capturing a frame until the estimated pose for that frame is recovered.
– From the moment of capturing a frame until the information in that frame and the
recovered pose estimate have been incorporated into the map
The choice of which one to use depends on how the SLAM output is used in the context of a
latency critical application.
1.2. Terminology and concepts in the field of SLAM 9
Throughput, in terms of frames per second processed, is more frequently used as it will
directly dictate the speed of robust operation and indirectly affect the quality of both pose
estimation (tracking) and mapping. This has mainly to do with the fact that SLAM operates
in unknown or partially recovered environments and requires a relatively small amount of
rotation and translation between successive frames for robust operation. A high throughput
will satisfy this requirement even for faster and more agile moving platforms. Throughput is
sometimes directly referred to as framerate since it is most often measured in terms of frames
processed per second.
1.2.4 Independent variables for SLAM and their effect
Depending on the specific algorithm implemented there are a significant amount of variables
that can be tuned to affect the quality and performance of that algorithm in different situations.
This subsection will summarise and describe the most common and important ones.
Variables relating to Map density
Many SLAM algorithms, including semi-dense SLAM, will not attempt to reconstruct 100% of
their environment but will instead recover portions of the environment in terms of a collection
of “mapped points”. In these algorithms there are various thresholds that affect the number
of points that the algorithm attempts to recover as part of the map. This also affects the
complexity of the tracking task, as a larger number of recovered map points means a larger
amount of points to match and compare for pose estimation.
Increasing the map density can increase the accuracy of tracking and mapping up to a certain
level, but then a point of diminishing returns is reached. It can also enhance the quality of the
map, and in some cases a comparatively high density is required for applications in which a
higher degree of environment awareness is important. However, it will also increase the latency
of the tracking and mapping tasks as the amount of data to process in both tasks increases.
10 Chapter 1. Introduction
Image resolution
The resolution of the captured images from the camera sensor will dictate the smallest ob-
servable part of the environment, and the amount of observations in each frame. As such, a
higher resolution stands to improve the algorithm’s accuracy (again until a point of diminish-
ing returns) but will increase the number of elements that need to be processed and directly
increase the latency of the tracking and mapping tasks. It should be noted that the frame
resolution used for the tracking and mapping tasks is not necessarily the same, as the image
can be downscaled for either task after being captured at the camera and does not need to be
equal.
Variables relating to noise modelling
SLAM algorithms are probabilistic and are dealing with imprecise sensor measurements and
generated estimates. As such they will attempt to correlate different observations to different
degrees and most SLAM algorithms will attempt to model sensor and estimate noise to im-
prove their quality. Depending on the algorithm and its implementation there will be different
independent variables relating to noise modelling that can be tuned to improve the quality of
the output for different environments and sensors used.
1.3 Motivation
This section will discuss the motivation for the work presented in this thesis. Due to power con-
straints, most embedded visual SLAM implementations focus on sparse SLAM that is adapted
towards reducing computational requirements. One main approach is to trade information
richness and robustness for performance by using lower sensor resolution, track a sparser set
of points instead of surfaces or objects and avoid denser or large-scale mapping and map
consistency completely. These solutions are either designed from the ground up to be com-
putationally efficient, for example [11] and [12] or are reduced versions of existing approaches.
Another approach on embedding sparse SLAM has been to design a lightweight but accurate
visual odometry algorithm that can achieve real-time performance on-board an embedded de-
1.3. Motivation 11
vice, with the option of offloading computation to a remote server for reconstructing a dense
map [13, 14] there. This comes with increased power consumption for the wireless communi-
cations, as well as increased latency. It also comes with a reduced area of operation, and very
high bandwidth requirements, and is not a good fit for applications requiring awareness of a
dense and complex 3D scene.
Moreover, this trend towards more accurate and advanced but computationally demanding
SLAM algorithms seems to be continuing at faster pace than advances in computing platforms’
raw performance. The examples of emerging applications in the embedded space introduced
previously require the level of quality only state-of-the-art, large-scale dense or semi-dense
SLAM can provide. These systems require a high level of understanding of their environment
that sparse SLAM inherently cannot provide.
At the same time, due to safety and robustness requirements, there is a need for very low
processing latency in many platforms. In most state-of-the-art SLAM algorithms there is
an underlying assumption that there is a small amount of translation and rotation between
successive processed frames. This means that depending on the movement speed of the camera,
a minimum level of performance for a certain platform is necessary to avoid very large errors
and potential failure in tracking due to fast movement. However, further improvements past
that minimum are important as they can provide a larger improvement in quality and accuracy
for the same algorithm. Figures 1.2 and 1.3, discussed further in Chapter 3, demonstrate the
impact of reduced performance during live SLAM, causing frames to be dropped instead of
processed from the camera. In Fig. 1.3, a 2× reduction in performance in the left side of the
figure compared to the system on the right, creates a large accumulated trajectory error and a
visible drop in quality and detail for the recovered map surfaces.
Meanwhile, in addition to the performance requirements and complexity of SLAM, most embed-
ded robotics and augmented reality applications have significant power and weight constraints.
These specifications rule out most of the conventional hardware that can perform cutting-edge
SLAM in real time in the embedded space. This amount of computational demand, not met
by software optimisation, can be met by new hardware platforms. Towards closing this gap
12 Chapter 1. Introduction
Skipping Frames due to performance constraints
Adequate performance for the movement speed
Figure 1.2: In a moving platform or camera, the performance of SLAM will directly affect theaccuracy of tracking and the accuracy and quality of the reconstruction, and in extreme caseslead to tracking loss. The map on the right is on a system that can deliver a 4× improvementin performance compared to the left. The map on the left has accumulated a very large errorfrom skipping frames due to performance constraints
Higher performance leads to more information processedReduced drift and error accumulation for pose and map
Figure 1.3: Even if the slower system can keep up, improved performance is crucial to improvequality and accuracy under fast movement. The map on the left was recovered on a systemwith a performance deficit of 2× compared to the one on the right
1.3. Motivation 13
between performance and efficiency for embedded SLAM, a custom domain specific hardware
architecture can provide a significant improvement both in terms of absolute performance, and
performance-per-watt compared to conventional general purpose processors. General purpose
hardware has to be fast and economical for the average case of software, and support all com-
binations of an instruction set. Modern high-performance CPUs can take advantage of some
instruction level parallelism, being able to concurrently translate and execute several arithmetic
instructions in parallel (referred to as how wide a CPU’s microarchitecture is), and different
parallel tasks can execute on multiple cores. However, their flexibility for supporting any soft-
ware task, means they will necessarily have a lower efficiency for special cases than a custom
solution.
Meanwhile, GPUs have seen significantly increased use as accelerators, often referred to as single
instruction, multiple threads (SIMT) processors. At the time of writing, standard software
languages such as C++ can be used to compile and execute code on the processing elements
in a GPU in a relatively straightforward fashion. They offer much more parallelism than
CPUs with their simpler but very wide vector instruction units, but they are still throughput
optimised many-core architectures targeting data-level parallelism. Their efficiency is thus
limited to certain types of algorithms that fit this model well and they usually have a much
higher latency to process a single thread than a traditional CPU.
Algorithms such as direct tracking for SLAM do not fall neatly in either case. They do come with
some instruction-level parallelism but are still computationally intensive for embedded CPUs,
and do not fit the data-level parallelism requirement for mapping on a GPU or multiple cores due
to data dependencies and a non-homogeneous computations between iterations with complex
control flow. Moreover their data-intensive nature would stress either of these platforms, both
in terms of data movement and random access patterns. The advantage of custom hardware is
twofold. First, by precisely controlling resource allocation a design will only have the processing
elements necessary for the task at hand. This means better utilized area, with less resources
taking space and power. Second, by designing and optimising the hardware architecture to
match the needs of a specific application its power efficiency and achievable performance can
be improved by a large margin.
14 Chapter 1. Introduction
At the same time, hardware realised as a custom integrated circuit, referred to as Application
Specific Integrated Circuit (ASIC), has a high initial cost of design and production. As such,
ASICs are usually produced when that cost is amortised by large production volumes, enabled
by the amount of re-usability and standardisation of an algorithm. Between these platforms
sits reconfigurable hardware, such as Field Programmable Gate Arrays (FPGAs).
FPGAs still allow most of the customisability of designing an ASIC, since they can emulate most
digital designs composed of digital gates and SRAM memory banks. At the same time, since
they are reconfigurable, they have a much smaller implementation cost and can be updated in
the field with a change in the HDL (Hardware description language) code describing the design
and a synthesis and place and route step which is a matter of hours as opposed to the cost of
issuing a new version of an ASIC chip which is almost as significant as designing it from scratch.
In this way, by utilizing this platform one trades a 7-20x reduction in performance and 10x
in power/area compared against an optimised CMOS implementation [15,16] but significantly
reduces the costs and timeline of development. Thus, a custom hardware design can be realised
and deployed faster and at a lower cost.
Crucially, even with this disadvantage over ASICs they are still more power efficient and achieve
a higher performance per watt than general purpose architectures for certain algorithms. More-
over, SLAM is currently a volatile field, with the methods and algorithms changing significantly
every few years. In light of this, taking advantage of the lower time for redesign on FPGAs,
the advantages of specialised hardware can be realised now on off-the-shelf devices, and the
hardware can then be updated to keep up with the software better than a specialised ASIC
could, and at a lower cost. Then finally, if the algorithms become more standardized, a well-
defined generalisable architecture can carry over most of its principles and conclusions and be
re-implemented on an ASIC to maximise the gains of custom hardware.
To conclude, by using off-the-shelf reconfigurable logic to develop custom, specialised hardware
for SLAM, a significant performance-per-watt advantage can be gained compared to general
purpose hardware. This can enable the use of advanced state-of-the-art SLAM algorithms
on low-power embedded platforms, bridging the gap between the current state-of-the-art in
1.4. Research Question 15
embedded vision and the latest trends in SLAM algorithms and providing a solution towards
achieving advanced capabilities for emerging applications. As we shall demonstrate in this
thesis, the accelerators that emerged from our research work addressed fully the computationally
demanding tasks in SLAM and offer real-time high-performance SLAM (more than 60 frames
per second) at a power-level an order of magnitude smaller compared to the general purpose
hardware widely used for state-of-the-art SLAM algorithms.
1.4 Research Question
In recent years many emerging applications have appeared in the fields of autonomous robotics
and embedded systems that require advanced computer vision capabilities. The high per-
formance requirements of algorithms that can provide such capabilities, combined with the
resource constraints of most embedded platforms, led to the research question underlying this
thesis:
“Is it possible to design a custom hardware architecture for embedded platforms that can bring
advanced SLAM capabilities with high performance together with low-power characteristics and
how should such an architecture be designed?”
Towards answering this question, we have focused on the following criteria. Firstly, such an
architecture has to be applicable to state-of-the-art algorithms. SLAM is a field that is evolv-
ing fast, and focusing on older, simpler algorithms would diminish the impact of a custom
hardware solution. Moreover, it has to achieve a significant improvement in performance per
watt compared to current embedded platforms for robotics, while maintaining or improving
on their power requirements for continuous operation. Finally, to have a high impact in the
field of embedded SLAM, it has to maintain real-time operation, being able to process at least
30 frames per second even in complicated environments, while if possible maintaining a low
latency from input to output.
16 Chapter 1. Introduction
1.5 Aims and Thesis Overview
In the work described in this thesis, we propose two specialised, domain-specific architectures
based on FPGA-SoCs to match the unique demands of these algorithms, and bring high per-
formance for state-of-the-art SLAM on an embedded power budget. The aim of this thesis is
to first present an overview of state-of-the-art visual SLAM with a focus on semi-dense meth-
ods, that offer a more information rich output than traditional sparse approaches in robotics.
Subsequently we will discuss in depth the performance and computation patterns of a state-
of-the-art example, LSD-SLAM [17], and proceed to present our research in overcoming the
challenges described above with designs based on FPGA-SoCs. This will lead to two high
performance dataflow designs that accelerate both the tasks of tracking and mapping for semi-
dense direct SLAM. These achieve a high-end desktop CPU level of performance at an order
of magnitude lower power consumption. Moreover they are designed to be scalable and easily
expandable/modified, and can be implemented on off-the-shelf FPGA-SoC chips.
Chapter 2 presents the necessary background and details for the algorithms discussed, utilized
and accelerated in this thesis, as well as the hardware platforms, tools and design principles
that are used as part of this work. Chapter 3 will present our profiling and characterisation
results for LSD-SLAM and a first straightforward deeply pipelined solution. Some first domain
specific optimisations are discussed, as well as the functionality of the tracking task in this
algorithm, and finally the resulting bottlenecks and challenges that were encountered and led
to the high-performance solutions of the next two chapters.
Chapter 4 will then discuss a novel, highly-optimised dataflow based approach that achieves
desktop-grade performance at an embedded grade power envelope for the task of tracking.
Taking advantage of the lessons learned in our previous research, the design presented in that
chapter focuses on providing an efficient, high-performance solution for Direct Tracking using
a high-bandwidth streaming architecture, optimised for maximum memory throughput.
Chapter 5 will present the design of a scalable and high performance, power efficient specialised
accelerator architecture targetting mapping in the context of a semi-dense SLAM algorithm.
1.6. Research Contributions and Statement of Originality 17
Finally, Chapter 6 will discuss our overall conclusions and contributions, and our view on the
future of this work and the field in general.
1.6 Research Contributions and Statement of Originality
This thesis presents the following original contributions on achieving high performance, state-
of-the-art SLAM on embedded low-power platforms. These contributions have affirmatively
answered the question stated in Section 1.4, covering all of our initial requirements.
• The characterisation and analysis of a state-of-the-art semi-dense SLAM algorithm, LSD-
SLAM. The results obtained guided the research of the domain-specific accelerators pre-
sented in this thesis, and gave a detailed overview of the performance requirements and
characteristics of this type of algorithm.
• The first, to the best of our knowledge, design for a high performance, power efficient
accelerator architecture for direct photometric tracking for SLAM on an FPGA-SoC.
• An investigation on factors that affect the performance of this type of accelerator, the peak
performance that can be achieved by an off-the-shelf FPGA-SoC and bottlenecks that can
arise when used as a coprocessor next to a mobile CPU for this type of application.
• The design of a scalable and high performance, power efficient specialised accelerator
architecture, that can process and update a map in less than 20ms, the average latency
between two camera frames in the context of semi-dense SLAM in state of the art robotics
applications.
• Lastly, a system-on-chip that can be combined with the work presented in Chapter 4
to form the first, to the best of this author’s knowledge, complete SLAM accelerator on
FPGAs targeting semi-dense SLAM, pushing the state of the art in performance and
quality for dense SLAM on low-power embedded devices.
18 Chapter 1. Introduction
These represent the author’s own work during his studentship at the Circuits and Systems
group of the Department of Electrical and Electronic Engineering, Imperial College London.
1.7 Publications
The original contributions of this thesis have been published in the following peer-reviewed
conference proceedings:
– ‘Semi-dense SLAM on an FPGA SoC’, Konstantinos Boikos and Christos-Savvas Bouga-
nis, 26th International Conference on Field Programmable Logic and Applications (FPL
2016), IEEE, 2016 [18]
– ‘A high-performance system-on-chip architecture for direct tracking for SLAM’, Kon-
stantinos Boikos and Christos-Savvas Bouganis, 27th International Conference on Field
Programmable Logic and Applications (FPL 2017), IEEE, 2017 [19]
– ‘A Scalable FPGA-based architecture for depth estimation in SLAM’, Konstantinos Boikos
and Christos-Savvas Bouganis, 15th International Symposium on Applied Reconfigurable
Computing (ARC 2019) [20]
Chapter 2
Background
2.1 A brief history of SLAM
The roots of SLAM can be traced back to the beginning of the 20th century with the first
analytical methods to use multiple views (photographs) of the same scene to recover some of
its geometry. This itself can be attributed to the first discoveries of lenses and optics centuries
ago that lead to the development of projective geometry followed many years later by the
earliest efforts to define epipolar geometry.
This was formalised as the principles of photogrammetry, with one of the most important early
works published in 1899 and again in 1904 by Finsterwalder [21]. For these early works I am
going to refer to the translation published by Guillermo Gallego et al. [22] and the survey of
Sturm [23]. In that work, Finsterwalder demonstrates that from a set of uncalibrated images a
projective 3D reconstruction is possible and provides an algorithm utilizing seven points for the
case of two images. Kruppa et al published in 1913 his work [24] in which first he discusses the
earlier findings of Finsterwalder in his paper “Die geometrischen Grundlagen der Photogram-
metrie” (The Geometric Foundations of Photogrammetry) where two views with known inner
orientation are sufficient to determine an “Object” up to scale. He then discusses a number
of improvements, including proof that with a calibrated set, 5 pairs of correspondences are
sufficient to provide a solution. According to Sturm [23], the concept of projective reconstruc-
19
20 Chapter 2. Background
tion, discussed in these earlier works was re-discovered in computer vision in the early 1990’s.
Indeed, in these early papers one will discover a discussion quite similar to the one found in
the beginning chapters in a modern leading book in computer vision, published a century later,
“Multiple View Geometry” by Hartley and Zisserman with a slightly updated terminology.
In the late 1980’s to 1990’s a very crucial improvement in the field came with the introduction
of automatic feature detection to replace the manually hand-matched correspondences of the
early works. Those include among others the Harris and Stephens corner and edge detector [25],
followed by various approaches, such as the computationally efficient FAST [26] introduced in
1988. Later approaches in feature detectors focused in new development boosting accuracy.
SIFT by David Lowe [27] aimed to add scale invariance, and a descriptor that would allow
accurate matching in a global space of points where now features would be matched using
the similarity of their descriptors as a score. SURF [28] attempts to improve both accuracy
and performance and gained wide adoption in computer vision, but was still time-consuming to
compute and compare. Later works that followed were binary descriptors such as ORB [29] and
BRISK [30] that aimed to provide a comparable or better accuracy overall, while being much
faster to compute. Avoiding costly floating-point operations, while taking advantage of their
formulation to add capabilities such as rotation invariance, these led to much more efficient
SLAM and computer vision algorithms.
In addition to the method used to track features and recover the pose, various methods are
used to generate and maintain a map. Most early works used a version of Kalman filtering to
optimise over a set of uncertain observations form uncertain positions, jointly optimising all
parameters e.g. Jin et al. [31] and later Davison et al. [32] recovering a pose estimate and the
map in one step. Meanwhile, the field saw the separation of tracking and mapping in separate
threads, and the separation of the optimisation problems firstly proposed by Klein et al with
Parallel Tracking and Mapping (PTAM) one version of which is published in [33]. Now large
scale maps could be optimised in terms of “Keyframes” as a background graph optimisation
problem, which led to improved large scale behaviour and performance taking advantage of
new multi-core hardware architectures.
2.2. Principles of state of the art SLAM 21
Simultaneously, the first dense and direct methods were being developed, although they were
initially non-real time. Matthies et al. proposed in 1988 a method for the “Incremental Esti-
mation of Dense Depth Maps from Image Sequences”. They described an algorithm which uses
pixel-based information to estimate depth and depth uncertainty at each pixel and refine these
estimates over time, utilizing a set of pictures with a known relative translation. They compare
their method analytically and qualitatively with the point based approaches of the time with
very encouraging results, and establish a rigorous theoretical base. They were then followed
by the work of Hanna, who achieved “an estimation of both ego-motion and structure-from-
motion” directly from a sequence of images, minimizing photometric error, essentially building
the first example of direct SLAM. Despite these advances, the field was first considered for
real-time SLAM two decades later, when advances in hardware allowed such formulations to
be explored.
In the following section we will discuss the current solutions, and the principles of operation of
SLAM at the time of writing. The focus will be on the current state of the field of SLAM, the
kinds of sensors and algorithms used, as well as an overview of their respective advantages and
disadvantages for embedded operation in light of emerging applications.
2.2 Principles of state of the art SLAM
In the field of SLAM, different sensors have been used [34,35] including Lidar, sonar and recently
RGB-D cameras e.g. [36]. Lidar and sonar deal with a map usually limited to two dimensions
around the sensor, and were used in early works for their simplicity and effectiveness. They
are also used as complementary sensors with sensor-fusion algorithms combined with visual
sensors both for applications which demand a high-level of robustness and accuracy, such as
self-driving cars. However they are heavier compared to a monocular camera, require high
power consumption and are mostly constrained in two dimensions, making them unsuitable for
many applications, especially as the sole sensor.
Active camera sensors in the RGB-D category combine a camera sensor with a technology
22 Chapter 2. Background
utilizing projected light to directly measure distance from the camera. Most recent approaches
recover depth by either projecting a grid light pattern in infrared and then using an infrared
camera to capture it, or the time-of-flight principle where a burst of infrared light is emitted
and then the time of reflection from a surface can be used to estimate depth. They both
utilize proprietary hardware and algorithms to directly send post-processed image and depth
information to the platform they are connected to. These have enabled high-quality dense
3D reconstruction in indoor spaces [37, 38] but are constrained in their area of operation to
indoor spaces of a few meters because of their design principles. They are also more expensive
and power-hungry than a simple visual sensor, making them less attractive for embedded low-
power robotics. As such, the work presented in this thesis focuses on enabling high quality
embedded SLAM using purely visual information from monocular cameras, as they work equally
well indoors and outdoors for larger spaces. Especially considering cameras developed for
computer vision applications lately that combine global shutter and lower noise levels with
their traditional advantage of lighter weight and power requirements in comparison to other
sensors, a passive visual sensor becomes a very good fit for lightweight, power-constrained
platforms such as the ones we are targeting.
In the domain of visual SLAM, and especially direct, semi-dense SLAM, the rate of processing
needs to be fast enough so that successive frames have a relatively small amount of translation
between them [1]. The desired rate of update varies for different applications. Focusing on
autonomous robotics, which is one of the central motivations for this field and our work, research
has shown that effective localisation needs a performance of at least 30 frames per second for
most moving robotic platforms. Moving to faster moving platforms, such as self-driving cars
and quadcopters, 50 to 60 frames/s or in some cases even higher performance can be necessary
for SLAM not to fail and lose tracking under agile movement as discussed in [39].
Faster moving platforms, such as fixed-wing aircraft, necessitate even faster sustained tracking
rates than that. In the work of A. Barry et al. [6] for example, stereo-based localisation and
mapping for obstacle avoidance on a fixed-wing drone updates the map with tracked points at a
rate of 120 frames per second. Augmented reality applications meanwhile, require a latency of
less than 16 milliseconds per frame from camera, to model update, to visualisation to provide
2.2. Principles of state of the art SLAM 23
an acceptable experience. Even though the number most widely used to refer to performance
in this field is frames per second, or Hertz in terms of update frequency, the important figure
is the combined latency of the localisation and mapping tasks, since the processing happens
in real time. While the latency of the camera tracking is the minimum time between two
camera frames being tracked, in reality the combined latency of tracking and mapping added
is crucial, as it is this latency that dictates the moment in time that the next camera frame
can be processed using an updated model of the environment, to utilize the most amount of
information and provide the best result.
SLAM is a complex probabilistic problem where visual information consisting of hundreds of
thousands of pixels needs to be processed for every frame. The first solutions focused on
tracking a small set of a few hundred to a thousand points, however state-of-the-art methods
are moving to complete surface reconstruction and dense point clouds where most of the visible
points need to be processed for every frame. The camera resolutions used are normally in the
region of 640x480, sometimes going up to 960x720. It was found that increasing the resolution
up to that level provides a small benefit to feature-based algorithms, and has a smaller positive
effect on direct algorithms [40] 1, while the runtime usually increases at least linearly with the
number of pixels. All this results in high computational requirements. State of the art SLAM
needs at least high-end multicore CPUs to run in real-time for feature-based and semi-dense
methods [17, 41] and often high performance GPUs for acceleration as for example in [42],
and [36] targeting a processing speed of at least 20-30 frames per second. McCormac et al.
in 2018 [38] showed impressive results in environment reconstruction and semantics but they
report a performance level of 4 to 8 frames per second for “unoptimised code” running on a
combination of high-end graphics card and multi-core CPU.
One other distinction in recent works is the actual method of tracking, and specifically the level
of visual feature at which SLAM operates. Within the continuum of sparse to dense SLAM
(which can be used to describe both the map density, as well as how much of it is used for the
task of tracking), SLAM is also categorised as direct and indirect (or feature-based). Indirect
1The effect varies on a case by case basis for different scenes. In the cited paper the accumulated error overmany runs in different datasets is used to demonstrate the effect
24 Chapter 2. Background
SLAM first scans incoming camera frames to generate a list of features and their descriptors.
In a second step these are then matched with past observations to generate a list of matches,
including filtering steps to remove outliers. Then the optimisation step is performed on that
set of matches, optimising the geometry of the points in a joint-optimisation with the camera
position. Finally, a consistent probabilistic framework, proposed in works such as MonoSLAM
by Davison et al. [32] can significantly reduce error accumulation and drift in the trajectory
estimation and improve robustness of the algorithm. This probabilistic tracking of high-quality
features continues to give promising results in camera tracking. The main drawbacks of these
methods are the expensive step of extracting high quality features and matching them across
frames, the computational complexity of the probabilistic fusion of this information and finally
the fact that the local map they produce is inherently a sparse selection of points, and requires
further processing to result to a dense reconstruction of the environment.
On the other hand, Direct SLAM uses the pixel intensity information in the camera frame to
optimise the pose, based on the photometric error of aligning the images from two poses, and a
photometric gradient to guide optimisation. Such an approach was first proposed in the 1990’s
but without the computational capabilities to run in real-time until recently. Current state-of-
the-art works include Stuhmer et al. [43] and DTAM by Newcombe et al. [42]. The idea is to
directly utilize all the pixels visible in the camera and attempt to track a dense model of the
environment between frames in the form of surfaces. This is computationally complex; GPU
acceleration was leveraged in Newcombe’s work to achieve real-time performance defined by the
authors as “in the region of 30 frames per second”. According to the authors, this work showed
that “... a dense model permits superior tracking performance under rapid motion compared
to a state of the art method using features” when compared with the feature-based methods
available at the time. The authors also demonstrated the usefulness of the dense model for
interacting with the environment with a physics-enhanced augmented reality application. This
work inspired later state-of-the-art dense methods mentioned here in this thesis such as [36].
One main issue with discussing and comparing work in the field is that many performance
statements such are written in the language quoted above. In this thesis, where possible the
performance and complexity of a solution is described in as accurate terms as possible but
2.2. Principles of state of the art SLAM 25
in cases where accurate statements are not included in the published works we can only offer
educated guesses as to what the authors are referring to.
The idea of directly using photometric information for tracking and mapping was developed
towards a different direction by Engel et al. in his work in LSD-SLAM [17] and Direct sparse
odometry (DSO) [44]. In LSD-SLAM, he showed that one can still attempt to create a model
of the environment directly from photometric information to achieve robustness and accuracy,
as well as a denser map. However, by ignoring the parts of the frame that are poor in infor-
mation such as empty space and flat textureless surfaces the runtime is reduced significantly,
allowing the algorithm to be more lightweight and run in real time on a multicore CPU. It
is also important to note that such surfaces cannot have depth recovered by matching visual
information anyway, since there are no distinctive patterns to match, and are one of the known
failure modes of visual disparity estimation. This was demonstrated to run with a performance
of 30 frames per second for tracking and 15 frames per second for mapping on a laptop CPU
in the first formulation of the algorithm in [1] and is mentioned as running “in real-time on a
CPU” in [17].
Again the problem with such statements from the authors is they are vague and do not offer a
good estimate of how powerful the machine actually are. However, one of the versions of that
paper gives a hint that an “Intel i7 quad-core CPU” was included in the laptop, and given the
time of publication, that refers to an Ivy Bridge or Haswell architecture Intel CPU, and the
only quad-core laptop models where 45-47W TDP chips, offering 4-6 megabytes of cache and
performance for different scenarios approximately 20-40% less than their desktop counterparts.
In addition, regarding image resolutions processed, this work, like most of the works we have
cited, uses a VGA resolution (640x480 pixels) for their quoted performance figures, so unless
expressly stated, the reader should assume this resolution figure for the works following.
Engel et al. took a different approach with their following work, DSO [44]. By constructing a
sparse sampling of points across the whole image, including low texture regions, and avoiding
a smoothness prior that makes a consistent probabilistic model infeasible to calculate in real
time, the authors developed a joint optimisation approach for all model parameters that runs
26 Chapter 2. Background
in real-time. In this approach, different modelled parameters including camera poses, camera
intrinsics (exposure, vignetting) and geometry parameters (inverse depth values) are jointly
optimised [44] in an effort to improve accuracy by explicitly modelling all known optical and
geometric variables. Optimisation is performed in a sliding window, where camera poses that
leave the field of view are marginalised, inspired by [45]. The information tracked is the pixel
intensity, and the map density can be tuned to trade off runtime with quality and resulting
density in different platforms.
In that work, they report superior performance to odometry methods using hand-crafted or
learned features adding merit to the idea of directly utilizing the visible information and an
inverse depth formulation. Finally, in that work the authors demonstrate the different sensi-
tivity of feature-based and direct SLAM to different sources of error. According to [44]: strong
geometric distortions caused by rolling shutter and imprecise lenses have favoured the indirect
approach, while newly developed cameras specialised for computer vision with a high framerate
and a global shutter can lead direct formulations such as DSO to achieve superior accuracy on
well-calibrated data since they directly model the photometric error.
A final important distinction is the difference between a full SLAM system and a visual odome-
try algorithm. Visual odometry focuses on maintaining an accurate position estimate and uses
the simplest most efficient form of map possible. Hence, it was used in early robotics to improve
the state of the art towards navigating in unknown environments. On the other hand, moving
from navigating to interacting autonomously, and building more capable, environment-aware
systems, a more complete reconstruction and understanding of the environment becomes neces-
sary. Full SLAM methods attempt to recover as much of their environment as possible, as well
as keep a global, consistent map and enabling loop-closing. SLAM focuses on reconstruction
accuracy and completeness as well as accurate odometry, and the quality of a SLAM algorithm
has to be judged on both tasks at the same time, localisation and reconstruction accuracy as
well as its large scale consistency and map management.
The literature so far has only recently starting discussing the quality of reconstruction and the
ability to efficiently re-visit places and handle global maps, and with the exception of some
2.3. Direct semi-dense SLAM 27
notable works such as the benchmark from Handa et al. [46], McCormac et al. [38] and Whelan
et al. [36] most works only report metrics on trajectory errors to evaluate SLAM. In practice,
map quality is a crucial aspect for many applications and it significantly affects computational
requirements since it impacts the amount of information that is encoded and needs to be
processed. The community of SLAM has initiated discussions around this, and we expect this
to change in the future, but at the time of writing most published works discuss “tracking
accuracy” or “tracking performance” as a quantitative metric, while mapping, reconstruction
and other capabilities even when discussed it is mostly on a qualitative level.
2.3 Direct semi-dense SLAM
One of the best performing and most well-known SLAM systems currently is LSD-SLAM,
which was published by Engel et al. in 2014 [17]. In comparison to other state-of-the-art
methods such as ORB-SLAM [41], it is the only one to achieve robust and efficient semi-
dense tracking, simultaneously with semi-dense map reconstruction. A high-density map is
an important feature for robotics, and skipping the step of feature extraction and matching
means operating well in a variety of environments and scales without relying on specific types
of features to work well.
A semi-dense SLAM algorithm generates efficient but high-density maps and can function
both indoors and outdoors relying on lightweight and power-efficient passive monocular camera
sensors. It stands to offer a good middle ground between dense methods requiring high-end
GPUs and custom RGB-D camera sensors and the lightweight point-based SLAM and visual
odometry algorithms that have so far been the state of the art for embedded systems. For these
reasons, and the accuracy of LSD-SLAM compared to other state-of-the-art methods, it was
selected as the best candidate to accelerate with the work presented in this thesis. As such, the
rest of the background will focus on semi-dense SLAM and the principles behind LSD-SLAM
to the extent that they are relevant to this thesis.
The rest of this section is based on the original work by Engel et al. [1, 17] and the discus-
28 Chapter 2. Background
sion included in his PhD thesis. A good overview of the details relevant to accelerating and
implementing the discussed algorithms in hardware will be provided here, focusing on factors
that have an impact on custom hardware design and information that is relevant to under-
standing the functionality of these tasks. For a more detailed and complete discussion on these
algorithms and their software implementation the reader should refer to the cited work.
SLAM is usually discussed in terms of tracking and mapping, however state-of-the-art methods
can be composed of many more components. These include among others a loop-closing thread
scanning for candidates in older Keyframes when revisiting locations and a background task
performing graph optimisation whenever a loop closure is successful to minimise the overall
error of a trajectory. However, the core components remain the first two. Firstly, they are
the ones that are latency sensitive when discussing real-time SLAM. If loop closure is delayed,
the local accuracy of SLAM should not be strongly impacted, however a skipped frame during
tracking will adversely affect all components and in rapid movement may lead to tracking loss.
Moreover, if these two tasks function well they immediately lead to accurate visual odometry
and a local (and in the case of LSD-SLAM semi-dense) reconstruction of the environment, while
the resulting Keyframes can be stored for later processing to recover a global reconstruction
at a later time or offline. Finally, as will be discussed in profiling results presented in the next
chapter, they also occupy most of the computation time. These reasons drove the decision
to focus on these two tasks with regards to hardware acceleration and base the design on a
heterogeneous platform with a mobile CPU executing everything needed to support large-scale
SLAM as software and custom accelerators targeting tracking and mapping.
2.4 Algorithmic overview of LSD-SLAM
The two main tasks of SLAM, as we have seen and will discuss in detail in the following sections,
are tracking and mapping. The main inputs and outputs of these tasks will be discussed in this
section, as well as the main functions they are composed of. The following data structures are
used in LSD-SLAM for tracking and mapping:
2.4. Algorithmic overview of LSD-SLAM 29
Camera Frame
The Frame data structure is used to contain each captured image from the camera, along with
associated metadata. It will initially be used during tracking, to estimate its pose in relation to
a Keyframe, and then will also be used for mapping to update a Keyframe’s depth observations,
using the pose estimate produced from the tracking task. Each Frame contains the following:
– Dimensions: Width and height in terms of number of pixels.
– Timestamp: The time of capture from the camera.
– Frame Id: Integer to efficiently identify the ordering of captured frames
– Intrinsic camera Matrix K: One static copy associated with all frames, it describes
all the intrinsic imaging and optical parameters of the camera and lens combination.
– Image: Pixel array, the actual image captured from the camera.
Keyframe
A Keyframe represents a portion of the generated map. It is essentially a camera frame and
a collection of depth estimates for a subset of its pixels. Each depth estimate is in terms
of distance from the plane of projection of the camera, and is one dimensional. However a
Keyframe also contains a camera to world pose for its frame, allowing the depth of a Keypoint
to fully represent the 3D position of observed parts of the world. Each Keyframe contains the
following:
– Pose: Camera to world pose estimate for the Keyframe’s camera frame.
– Frame: A camera frame that was selected as a Keyframe candidate and its associated
metadata.
– Keypoint Array: Array has same dimensions as Frame (width x height). Each Keypoint
is a potential depth observation, and is composed of the following variables:
30 Chapter 2. Background
– Valid bit (Is there a valid observation for this pixel)
– Blacklisted (Has it been blacklisted after multiple failures to match)
– Validity Score (Used to keep track of successful or failed attempts to match)
– Inverse Depth (Depth estimate after a map update)
– Inverse Depth Variance (Estimated variance for above depth)
– Smoothed Inverse Depth (Smoothed version of inverse depth)
– Smoothed Inverse Depth Variance (Smoothed version of variance estimate)
– Max gradient array: Same dimensions as Frame. The maximum gradient in a region
around every pixel. A precise definition of its calculation is included in Chapter 3.
Algorithm
The main tasks and the control flow between tracking and mapping are presented in Fig. 2.1.
There are other secondary and background functions involved in off-the-shelf SLAM but they
are either not relevant to this thesis or they do not significantly affect performance so they will
not be discussed in this section.
The main input is a series of images from a camera that are immediately converted to a Frame
data structure as they move through the algorithm. The other main data structure is the
Keyframe, storing the current estimate of the local map. These two main data structures are
shared and need to be synchronised continuously between the two tasks. In the figure they are
indicated with orange coloured boxes, while the light blue boxes represent the main functions
necessary to perform tracking and mapping.
The Track Frame function uses the tracking reference, composed of a Keyframe and its max
gradients array, along with the previously tracked Frame’s pose estimate, to estimate the pose
of the new camera Frame. The tracked Frames are then placed in a queue, depicted with orange
colour in Fig. 2.1, to be used to refine the Keyframe’s depth estimates for each Keypoint and
generate new observations, stored as new Keypoints. In case of a large performance difference
2.4. Algorithmic overview of LSD-SLAM 31
Mapping
Capture new
image
Secondary background functions
GUI
Loop closure detection
New Camera Frame
Tracking
Get tracking reference
Track Frame
Calculate Keyframe closeness
score
New Frame + Pose
Generate tracking
reference Update Keyframe
Create new Keyframe
Keyframe, max gradients and pose
Keyframe data structure
Distance score over threshold ?
NoYes
Unmapped tracked Framesqueue
Retire previous Keyframe
Graph optimisation
Keyframe graph
Figure 2.1: SLAM algorithmic overview. The main tasks are inside the two dashed rectangles.The main data structures are indicated with orange coloured boxes, while the light blue boxesrepresent the main tasks necessary to perform tracking and mapping. With grey we indicatesome of the background tasks involved in full SLAM.
between mapping and tracking, the oldest tracked Frames can be dropped but this can result
in a reduction in the quality of the map and the robustness of the algorithm. Once the “close-
ness score” between Keyframe and currently tracked Frame goes over a threshold, the depth
estimates are transferred to a new Keyframe initialized from a recently tracked Frame. A key
characteristic of the control flow of the algorithm is the interdependency between the two tasks
of tracking and mapping and the requirement for a high communication bandwidth between
them.
Computational cost
The dependent variables in the algorithm are mainly the scene complexity (how many high-
texture areas are visible in the camera’s image) and the quality of the camera and lens apparatus
which can influence the sharpness, contrast and optical and electronic noise present in the digital
image. Noise and lack of focus is undesirable and will always reduce the quality and robustness
32 Chapter 2. Background
of the algorithm.
Higher scene complexity means:
• More points to map and track (for the successfully mapped ones), a denser map
• More amount of useful information, and the potential for a better accuracy
• Higher computational cost, since there is more data to process for each tracking and
mapping update.
In Section 1.2.4 we discussed independent variables in SLAM from a theoretical standpoint.
In practice, in LSD-SLAM the below are the main tunable independent variables that affect
quality and complexity.
Frame Resolution - Tracking and Mapping:
This can be the camera’s resolution, or lower by subsampling the original image. Same applies
to the Frame used in mapping. Most often in LSD-SLAM tracking uses a subsampled version
of the Frame used in mapping, where the dimensions are halved for performance reasons.
Pyramid levels during tracking:
Starting from the captured Frame, the image is subsampled by a factor of 4 (each dimension
divided by 2) a number of times, generating a sequence of image levels with a progressively lower
resolution. Tracking is then performed first at a very coarse resolution, and after convergence
repeats at the next finer level, using the pose estimate of the previous level as a starting point.
This process significantly improves the convergence radius and robustness of the algorithm with
a relatively small increase in computation time.
Maximum iterations threshold for Tracking:
Tracking is iterative, and lasts until the error or convergence rate are very low, or a maximum
number of iterations are reached (to improve the worst-case computation latency). These
parameters are tunable and can affect worst-case runtime and accuracy.
All of the above will be discussed qualitatively and in-depth in Sections 2.5 and 2.6 as well as
Chapters 4 and 5. However at a high level their effect can be modelled as follows.
2.4. Algorithmic overview of LSD-SLAM 33
Tracking operates on a set of k mapped points stored as Keypoints in the Keyframe, where
0 ≤ k ≤ N and N = width × height is the resolution of the Frame used for mapping. If a
different resolution is used for tracking and mapping, a subsampled version of the Keyframe
needs to be created as well, where each subsampled Keypoint is assigned a weighted average of
the properties of the 4 Keypoints it replaces. Further subsampled versions of the new Frame and
the Keyframe are then created for each pyramid level used in tracking, each with a resolution
1/4 the resolution of the level above it.
For infinite pyramid levels, the amount of pixels included in all pyramid levels subsampled from
a starting resolution N, would be ( 14
+ 14· 1
4+ 1
43+ ...)N. As that is a geometric series, we know
that:∞∑n=1
1
4n=
1
3
Therefore the amount of pixels to process is always less than or equal to:
N +1
3N =
4
3N
for tracking and N for mapping. Though one iteration of tracking is less computationally
intensive than a complete update of the Keyframe, it is repeated multiple times for each level.
However, the upper bound of that is known, and tunable. The maximum iterations per level
are progressively fewer for lower subsampling levels, from multiple tens of attempts for coarser
levels to less than 10 for the finer and most computationally expensive one. Therefore, the
worst case latency for tracking is asymptotically bounded by a constant multiple of 43N .
Gradient threshold:
Pixel selection depends on the maximum gradient in an immediate neighbourhood, which is
affected from scene complexity, but the gradient threshold used to perform the selection is a
tunable independent parameter.
By tweaking the gradient threshold, and depending on the scene complexity, we can tweak
the subset of pixels that we are processing for tracking and mapping. For typical scenes and
thresholds the number of valid Keypoints is k ≈ 0.20N . Mapping has to process all N pixels
34 Chapter 2. Background
to generate the k valid observations. As we shall discuss in the following sections, the cost per
Keypoint varies, and a non-valid Keypoint will require even less processing, but the maximum
amount of processing steps is bounded.
Therefore, although the precise execution time is tunable through the parameters above, both
tracking and mapping have a complexity of O(n), where n is the pixels present in the camera’s
Frame. There are other computations associated with executing and linking these two tasks,
but they are all either constant time with a complexity of O(1) or are equivalent to N multiplied
by a constant and therefore Θ(n). As such, the broader computational cost of LSD-SLAM is
in the order of O(n). A more precise theoretical discussion of the different steps involved in
the algorithm will be introduced in the next two sections. Then the tracking and mapping
functions will be described in more precise detail in Chapters 3 to 5, along with their hardware
implementation.
2.5 Tracking in LSD-SLAM
The next two sections will introduce the key ideas behind direct semi-dense tracking and semi-
dense depth map estimation, based on the works of J. Engel et al: “Semi-dense visual odometry
for a monocular camera” [1] and “LSD-SLAM” [17]. We will discuss the basics of how they
were implemented in software and the characteristics that turned out to be important during
the research work presented in this thesis.
The basis of LSD-SLAM is semi-dense photometric tracking, closely coupled with a semi-dense
filtering depth estimation algorithm. The aim of tracking is to recover the camera pose with
respect to the world for each camera frame. LSD-SLAM projects the observed 3D points of the
world from the generated map to the current camera frame to optimise directly on the pixel
intensity differences in a process called direct whole-image alignment. This is expressed as a
variance-normalised weighted least squares minimization of the photometric error. For tracking,
only the information-rich points in the camera’s view are used based on the photometric gradient
value around a point. After the optimisation process, a pose estimate is generated for the
2.5. Tracking in LSD-SLAM 35
camera, describing its position and orientation in 3D space. The algorithm then uses that
pose estimate to recover the depth of observed points, according to the principles of multi-
view geometry [47], while existing depth observations are updated as a Gaussian probability
distribution with their associated mean and variance.
A key idea is a depth map, propagated from frame to frame and continuously used for tracking
while being updated with every newly tracked frame using two-view stereo. The depth map is
initialised from a camera frame, matching and storing a maximum of one depth estimate per
camera pixel location. Each of these estimates is stored in a data structure, along with the
frame it was initialised on, as a Gaussian probability distribution with its variance estimate.
From this point forward, the data structure containing all of the above information with the
associated camera frame will be called a Keyframe, and the valid depth estimates inside it are
also referred to as Keypoints. After a sufficient camera displacement, when a distance/disparity
heuristic between the current frame and the Keyframe is satisfied, a new Keyframe is generated
from the new point of view. Then, the previous one is retired and is incorporated in a graph
of Keyframes, which for LSD-SLAM represents the global map. Local and global mapping will
be described more in the next section.
In Fig. 2.2 we can see a visual representation of the depth and variance values collected in
the Keyframe and the calculated error (residual) from projecting these 3D points from the
Keyframe’s point of view to that of the camera. In Fig. 2.2(a) we can see the two camera images
used during tracking and mapping. In Fig. 2.2(b) and (c) we can see the estimated inverse depth
map and its associated variance for these two views, where the colour transitioning from red to
blue indicates the magnitude of the recovered depth, and variance, with red being closer, or a
smaller variance. In Fig. 2.2(d) and (e) we see a visualisation of the residuals on the left and
their variance on the right when tracking (d) and mapping (e) with lighter colour indicating
a larger value. Finally, Fig. 2.2(f) visualises the normalising weights for the two views, used
during tracking optimisation.
Notation: In this section, matrices are denoted as bold, capital letters (R) and vectors as bold
lower case letters (t).
36 Chapter 2. Background
Figure 2.2: Figure adapted from [1] to demonstrate tracking with direct Keyframe alignmenton sim(3) utilizing the estimated depth map. We can see from two camera views the currentstate of the map on the left as a collection of inverse depth and depth variance values for themapped points and the photometric residual on the top right as a result of the reprojection.
The camera poses are initially represented as 3D Rigid Body Transformations as in Eq. 2.1,
while the pose to pose constraints between Keyframes are represented with 3D Similarity Trans-
formations, which have an additional scale factor s ∈ R as in Eq. 2.2:
T =
R t
0 1
∈ SE(3) (2.1)
T =
sR t
0 1
∈ Sim(3) (2.2)
LSD-SLAM uses pixels characterised by a gradient bigger than a set threshold to perform
tracking and mapping. It operates under the assumption that the pixel areas that are the most
2.5. Tracking in LSD-SLAM 37
useful are those that contain a high enough intensity gradient, and therefore some kind of texture
or edge. During tracking, LSD-SLAM does not require the intermediate and computationally
demanding step of detecting and comparing features to find unique matches, and at the same
time manages to create a more dense reconstruction of the environment, offering a good middle
ground between sparse and dense algorithms.
During optimisation, LSD-SLAM uses the vector ξ ∈ se(3) of the associated Lie-algebra directly
as a minimal way to represent the camera pose instead of using SE(3) notation as it allows
optimisation on the pose directly. Lie Algebra offers a way to apply mathematical optimisation
methods to the poses obtained, by representing them as elements of a Lie group, which is a
locally Euclidean differentiable manifold. After the first estimation, each pose will be processed
and updated by an iterative optimisation process, the Levenberg-Marquardt algorithm [48].
They can be converted back to SE(3) with an exponential map G = expse(3)(ξ).
The pose is recovered by starting from a pose estimate T and aligning the frame that is tracked
with the current Keyframe, to minimise a photometric error. The inputs of the tracking task
are the initial pose estimate, the camera frame to be tracked and the semi-dense depth map
stored in the Keyframe. The optimisation happens over the variance-normalized sum of the
intensity disparity, as in the following equation:
Ep(ξji) =∑p∈ΩDi
∥∥∥∥ r2p(p,ξji)
σ2rp (p,ξji)
∥∥∥∥δ
(2.3)
in which r is the residual, calculated for the subset of pixels p in the Keyframe which contain
a valid depth value Di. Its value is the photometric disparity of projecting a Keypoint p to
the frame captured at the camera’s current viewpoint based on the current pose estimate, as in
Eq. 2.4. This intensity is further refined with an intensity estimate Ij through an interpolation
of the values of the 4 pixels surrounding the projected point on the camera frame. In Eq. 2.4
this sub-pixel interpolation is symbolised with ω.
rp(p, ξji) := Ii(p)− Ij(ω(p, Di(p), ξji)) (2.4)
38 Chapter 2. Background
In Eq. 2.3, σ2 is the residual’s variance, computed using covariance propagation and utilizing
the inverse depth variance Vi stored in the Keyframe for each point under a Gaussian noise
assumption:
σ2rp(p, ξji) = 2σ2
I +
(∂rp(p, ξji)
∂Di(p)
)2
Vi(p) (2.5)
The operator
∥∥∥∥.∥∥∥∥δ
is the Huber norm, calculated as in Eq.2.6, a normalizing factor that reduces
the effect of larger residuals on the optimization process improving the accuracy and robustness
of the algorithm in the presence of outliers.
∥∥∥∥r2
∥∥∥∥δ
:=
r2
2δif |r| ≤ δ
|r| − δ2
otherwise
(2.6)
Finally, the sum is minimized using the Levenberg-Marquardt method which performs a damped
Gauss-Newton optimisation. Gauss-Newton is based on a local Taylor approximation of the
error function to improve how rapidly and accurately the optimisation converges to a minimum
compared to gradient descent. Levenberg-Marquardt converges faster by adding a positive
multiple of an identity matrix of the same size to the Hessian matrix: H(x) + λI essentially
interpolating depending on the value of lambda between a traditional Gauss-Newton and gra-
dient descent and is better able to deal with non-linear functions such as the photometric error
used here compared to traditional Gauss-Newton implementations [48].
In this implementation, the Hessian is substituted with the approximation JTJ , so the step is
formulated as:
δξ(n) = −(JTJ + λI)−1JTr(ξ(n)) (2.7)
where ξ ∈ SE(3) and n is the optimisation step, with:
J =∂r(ε ξ(n))
∂ε
∣∣∣∣ε=0
(2.8)
Additionally, in LSD-SLAM a weighting scheme is implemented, where in each iteration a
2.5. Tracking in LSD-SLAM 39
weight matrix W = W (ξ(n)) is generated depending on the residual and the depth certainty
(from the variance estimate). The residual in the iteratively solved error function then is
multiplied by the weight factors, the error function becomes:
Ep(ξji) =∑p∈ΩDi
∥∥∥∥w(p,ξji)r2p(p,ξji)
σ2rp
(p,ξji)
∥∥∥∥δ
(2.9)
Thus the update is computed by solving
δξ(n) = −(JTWJ + λI)−1JTWr(ξ(n)) (2.10)
The goal of the optimisation process is to estimate an update δξ(n) that converges towards
a minimum of the error function as quickly and accurately as possible. The first step is to
calculate the residual, weights, and perform one iteration of the optimisation process. After
this step, the linear system generated is solved for different values of λ, the residual and weights
are recalculated and a tentative update step is generated and applied to generate a new pose.
If the new pose decreases the error then the pose update is applied as follows:
ξ(n+1) = logSE(3)
(expse(3)(δξ
(n)) · expse(3)(ξ(n)))
(2.11)
and the Hessian and Jacobian are recalculated from a new position. Otherwise the update is
rejected, and the system is solved again for a new lambda. This process is stopped if a maximum
number of iterations is reached, if the error decreases by an amount smaller than a threshold
or if the update step is smaller than a set threshold, in an effort to improve convergence and
reduce the number of iterations for improved computational efficiency, especially in the light of
pyramid processing.
The discussed optimisation process, is applied for different resolution levels, in a pyramid
representation at different levels of subsampling from coarse to fine. This aids convergence, as
direct image alignment is inherently a non-convex optimisation problem. This coarse-to-fine
processing works by initially smoothing the search space and optimising there, so that at the
40 Chapter 2. Background
next, finer step, the search is initialised at an already good estimate. Starting sometimes at a
resolution as low as 15x20 or more typically 30x40, it was proven to be a very good solution to
increase the convergence radius and accuracy in a variety of scenarios [17].
Level 1 – 1/2 Resolution
Level 0 – Original Resolution
Level 2 – 1/4 Resolution
Level 3 – 1/8 Resolution
Figure 2.3: Pyramid processing. Starting at a lower, coarser resolution increases the radius ofconvergence for the optimisation and improves the speed of reaching a minimum
As tracking has to begin at a resolution many steps coarser than the current camera/map
resolution, for every frame tracked a generation of subsampled levels is required for the two
data structures: the current camera frame and the Keyframe. In the case of the camera frame,
this is a simple averaging over four pixels for successive division of the resolution by two in both
dimensions. To support the multi-level tracking process a subsampled version of the Keyframe
is also generated. Firstly, the pixel values of the Keyframe’s image are subsampled in the same
way as the current camera frame. Simultaneously, to assign depth and depth variance to these
coarser Keypoints, the algorithm combines the depth map observations (if valid) across the
four pixels at the next finer pyramid level, weighing the process with their variance values to
give more importance to the most confident observation. If the resulting confidence is below
a threshold (or indeed all four constituent Keypoints are invalid) the Keypoint at this level is
simply marked as invalid as well.
2.6. Mapping in LSD-SLAM 41
2.6 Mapping in LSD-SLAM
After the tracking task provides a pose estimate the first step for mapping is to decide which
points to map. In the literature of SLAM different methods have been used to judge point
quality for sparse or semi-dense SLAM, usually centring on the idea that changes in pixel
intensity indicate a possible feature location [29, 33]. In the semi-dense methods we discuss
here, the points to attempt to map are selected based on the maximum gradient of a pixel
and of its immediate neighbourhoods being above a threshold. Pixels with a higher local
gradient are considered good candidates to much and this is refined after tracking with a “pixel
usefulness” metric updated when a valid map point is used successfully or unsuccessfully for
tracking. This is a natural choice since semi-dense tracking in LSD-SLAM attempts to directly
use image intensity gradients to optimise camera pose estimation. It should be noted at this
point that in these works there is also an underlying assumption of smoothness for the world
observed.
For each depth update, the aim of mapping is to perform an exhaustive search for a new
observation of each good quality point in the Keyframe along a line on the current frame.
This line is the epipolar line, and the search is conducted using the pixel’s intensity value.
Geometrically, if the relative position and orientation of the camera is known when two frames
were captured, it is proven that a point observed on one camera frame will always project to a
line on the plane of the other camera’s frame [47] called the epipolar line. This is demonstrated
in Fig. 2.4. Two camera frames will not always observe the same point in frame, as the
epipolar line may lie completely outside the frame that a sensor will capture. As such the
search is restricted on the intersection of the line and the image frame. If a successful match is
found, this can then be used to calculate a new estimate for the depth value of that point.
Another idea utilized in semi-dense mapping in LSD-SLAM is that if there is a prior depth
observation with sufficient confidence in its observation (judged by the estimated variance in
the gaussian hypothesis) the estimated variance is used to limit the search interval to d± 2σd,
where d and σd denote the mean and standard deviation of the prior hypothesis. At the end
of the search for a good match, a sub-pixel localisation is performed at the match location in
42 Chapter 2. Background
Real World Point X
OR
X2X1
X3
eL eROL
XL XR
Epipolar Line
Figure 2.4: Epipolar geometry, the epipolar line is depicted with orange colour
an attempt to reduce the error further, by interpolating between two neighbouring matches.
Finally, instead of scanning to match a single pixel, a squared error function comparing 5
equidistant points is used to improve accuracy. This approach significantly increases robustness
and gives a small increase in complexity since it still places and searches for these points along
the same line, meaning successive values can be reused.
After a successful disparity search along the epipolar line, and an optional step of subpixel
refinement, the depth of the point is calculated. In LSD-SLAM an inverse depth formulation is
used and stored for each point, and the authors have included both photometric and geometric
errors in the formulation of the uncertainty σ2d. If this is the first successful observation of
a point, it is initialised with this observation as its depth estimate. Subsequent observations
are incorporated into previous ones by multiplying the two distributions (corresponding to the
update step in a Kalman filter) [1].
During the lifetime of the SLAM algorithm, after a sufficient distance or viewpoint change
from the pose of the current Keyframe a new Keyframe is instantiated from the latest camera
frame, to which the current depth map is propagated. Using the two frames’ relative 6-DoF
2.6. Mapping in LSD-SLAM 43
Figure 2.5: From [1], the top row is different camera frames overlaid with the estimated semi-dense inverse depth map. The bottom row contains the camera view as a pyramid with blueedges, with its associated trajectory as a line in front of a 3D view of the tracked scene.
position, the inverse depth for each point is calculated for the new position with a small increase
in uncertainty. Keeping the uncertainty increment small was found to be more effective at
reducing drift [1]. The point is then stored as a depth estimate to its closest corresponding
integer pixel location in the new frame. In the case of two points mapping to same frame pixel,
if they are statistically similar they are fused as two observations. Otherwise, the point with
the largest depth is treated as occluded and deleted.
The final steps of depth mapping have to do with regularizing the depth map after every update
step. There are two algorithms at work here. The first one has to do with adding observations in
“holes” surrounded by successful neighbours and with removing outliers if a pixel or most of its
surrounding neighbours have become invalid. The notion of valid and invalid here corresponds
to a validity score stored in the Keyframe for each point, decreased or increased according to
successful or unsuccessful observations during tracking and mapping. It can also be invalided
directly in special cases, such as a gradient falling below a certain threshold in a new Keyframe
due to changes in angle and distance.
The second filtering algorithm calculates a smoothed version of the depth, stored separately.
The smoothed version is in use to initialise the search parameters for the epipolar line to the
most probable centre and range, according to a smoothness prior for the world. It can also be
used for visualisation, robotics and augmented reality applications. The exact value is used in
the rest of the algorithm and in tracking and is where the current Gaussian hypothesis remains
44 Chapter 2. Background
stored.
Lastly, LSD-SLAM uses older Keyframes that have been retired to represent the global map.
As has been mentioned previously, after a sufficient distance/disparity heuristic threshold, a
Keyframe is replaced by one closer to the current view of the camera, and the depth values it
contained are propagated to the new one. Once a Keyframe has been retired it is incorporated
in a global pose-to-pose graph. In that graph, each Keyframe is represented as a vertex, with
3D similarity transforms as edges, as in Eq. 2.2, adding scale information as a 7th degree of
freedom. This scale factor is included in the vertex to keep track of scale changes since for
every propagation of depth between Keyframes the map is scaled to keep the average depth
value close to 1.0.
Since in monocular methods absolute depth values are unrecoverable due to the scale ambiguity
problem, the depth map is always relative. Taking advantage of this, scaling the map to a value
of 1 when creating a new Keyframe improves the behaviour of the optimisation processes of
tracking and mapping, while simultaneously keeping track of scale changes between Keyframes.
The algorithm then uses a graph optimisation method to reduce tracking drift in the background
during loop closures, using only pose to pose constraints to increase computation efficiency. The
graph optimisation framework used in LSD-SLAM is g2o published by Grisetti et al. [49] and is
available as an open source library. This, combined with loop closure and the ability to revisit
and re-use old Keyframes improves LSD-SLAM’s large-scale capabilities and accuracy.
2.7 Proposed Architectures and FPGA-SoCs
The architectures proposed in this thesis were designed with a heterogeneous system-on-chip in
mind. The central idea is that different processors and custom hardware can provide comple-
mentary capabilities. Thus, a heterogeneous system-on-chip can be greater than the sum of its
parts. A mobile CPU can provide the flexibility and compatibility with off-the-shelf software
implementations and libraries to enhance the support for state-of-the-art SLAM. However, such
SLAM algorithms need a higher level of performance than any mobile CPU can provide. They
2.7. Proposed Architectures and FPGA-SoCs 45
are data intensive and characterised by a high amount of operations per data point and variable,
complex control-flow.
In contrast, specialised accelerator units, designed specifically to take advantage of the char-
acteristics of the algorithm, stand to offer significantly more compute performance in a more
power-efficient platform for the performance critical parts of the application. Implemented in
the same chip as the CPU, due to physical proximity, the latency and energy cost of commu-
nication between the accelerators and the CPU is minimised, which enables cooperation in a
finer granularity and a more efficient manner. Simultaneously, when operating in the same
memory space this cooperation will be more efficient with having access to the same off-chip
resources, such as an off-chip DRAM. For example, a single well-designed DRAM controller
will improve the efficiency of memory traffic from all sources. Moreover, the proximity of the
CPU’s caches to the custom memories in the accelerator offer the opportunity of coherency
protocols to be implemented to achieve low-latency synchronisation of shared data. A concept
of this architecture is depicted in Fig. 2.6.
Mobile CPU
Cache Subsystem
Memory Controller
DRAM
Mapping Tracking Camera
Shared Image Caches
Figure 2.6: Concept heterogeneous architecture block diagram
46 Chapter 2. Background
Field Programmable Gate Arrays
However, as mentioned in the introduction, the cost of implementing these designs directly
as ASICs is very high. Furthermore, the application domain is still actively investigated and
changes arise often in the state of the art, meaning the long development cycles characteristic
of custom silicon cannot provide an immediate solution. Because of this, the designs presented
in this thesis were developed with an FPGA-SoC in mind that can offer most of the benefits
of custom hardware while reducing development and deployment costs and time significantly.
As a result, an off-the-shelf FPGA-SoC can be used directly by researchers in the field and
adapted for different needs, or included in industry projects. Once the architectures and the
algorithms both mature, most of our architectural choices stand in the realm of fully-custom
ASICs and they can then be adapted to that domain to realise even higher performance and
power efficiency.
Moreover, modern FPGAs now come with many hardened units for frequently used or expensive
operations. In Fig. 2.7 we can see the building blocks that make up a modern FPGA fabric.
Look-up tables that emulate digital gates are organised in logic blocks with extra capabilities
while dedicated SRAM memories form “columns” of Block RAM (BRAM) between them for
low latency multi-port access. Dedicated DSP blocks, containing a multiplier, double precision
accumulator, pre-adder and other hard-coded operations (such as XOR and pattern detection)
implemented in silicon enhance the performance and reduce the resources needed for many fixed
and floating point math and binary operations. All of this is surrounded by fast reconfigurable
interconnect which improves routing significantly over earlier FPGAs. The combination of these
advances comes with a lower latency and resource cost than logic-only based FPGA fabrics of
the past, significantly improving the gap in performance and area discussed in the introduction.
These resources are organised in slices, represented as the smallest parallelograms in the figure
and then clock domains, surrounded by switch interconnect. Finally, at the border usually
there are dedicated I/O circuits and ports to connect to other resources on and off chip. This is
summarised in Fig. 2.8, presenting an example block representation of a modern FPGA fabric.
As such, the performance gap to ASICs comes an order of magnitude or less for some designs,
2.7. Proposed Architectures and FPGA-SoCs 47
I/O
BRAM
BRAM DSP
DSP
LogicLogic
LogicLogic
LogicLogic
LogicLogic
LogicLogic
Switch interconnect
Switch interconnect
Figure 2.7: FPGA architecture. Dedicated SRAM memories and DSP blocks with capable hard-ened multipliers have significantly improved the efficiency of frequently used and traditionallycostly operations
I/O I/O
I/O
Figure 2.8: A modern FPGA fabric is organised in clock domains made up of slices with theedges of the silicon usually housing communication circuits and ports.
48 Chapter 2. Background
allowing the use of architectures implemented in FPGA-SoCs as permanent accelerators in
production level devices. In this thesis, our work will be presented first as an architecture,
with some choices based on the FPGA-SoCs used for implementation but otherwise at a level
that can be generalised to other platforms. Then the evaluation of the architecture will be
presented, where the design will be discussed as implemented on an off-the-shelf FPGA-SoC
device.
The FPGA-SoCs targeted in this thesis are of the Zynq family from Xilinx. The CPU and
FPGA can function independently, and coexist on the same chip sharing a capable and efficient
interconnect allowing our designs to be realised at a good performance level. In Fig. 2.9 we
have a simplified view of the parts of the SoC and interconnect that are relevant to this thesis.
Connections to peripherals and details not relevant to our discussion are omitted for clarity. One
can refer to Chapter 5 on the Xilinx Zynq-7000 TRM [2] for more details on the interconnect
of the SoC and the rest of the system-on-chip.
DRAM ControllerReconfigurable Logic
512kB L2 Cache and Controller
L1 I/D Cache
Memory Interconnect
64-bit
Master Interconnect
for Slave PeripheralsGeneral Purpose
Slave Ports
High Perf.AXI
(HP[3:0])
FIFO
FIFO
FIFO
FIFO
32-bit
64-bit
ARM Cortex-A9Dual Core
Figure 2.9: Zynq 7-series FPGA-SoC and Interconnect. Only connections relevant to thearchitectures researched in this thesis are included for clarity of presentation. Source: Zynq-7000 TRM [2]
2.7. Proposed Architectures and FPGA-SoCs 49
The FPGA has dedicated high bandwidth buffered ports leading to an interconnect that allows
simple adjustments such as combining multiple sequential requests for increased efficiency. This
links then to a high-efficiency DRAM controller with content-addressable memories (CAMs),
that can further reorder the requests at the DRAM level to maximise the achievable efficiency.
In addition to these connections, there are dedicated slave-master, and master-slave general-
purpose ports to the CPU that can be used for control and more fine-grained communication and
other ports to peripheral devices. For our architectures the focus is on the general purpose AXI
controllers, the four high-performance ports (exposed to the FPGA as AXI HP[3:0]) functioning
at a maximum of 150 MHz, as well as the ARM Cortex-A9 CPU and its cache subsystem,
communicating at more than double the frequency of the FPGA side but through a single port.
High-level Synthesis
Finally, the architectures presented in the next chapters were designed and implemented using
Xilinx’s Vivado High-level synthesis (HLS) tools. These allow the hardware to be expressed
as a combination of C/C++ routines and pre-processor pragmas which guide the translation
from C/C++ to VHDL/Verilog. Using these pragmas one can guide pipelining, interfaces,
hardware unit reuse, blocking and non-blocking evaluation, resource allocation and the static
scheduling of operations to hardware units as at the HDL level. Then the designer gets a
quick estimate to the performance and can view the resulting resources instantiated and the
scheduling of operations on hardware units. Then the hardware design can be debugged first as
a C-simulation to correct logic errors faster with the help of software debug tools before testing
on the slower but more accurate RTL-level simulation.
The use of these tools greatly accelerates the development and debugging of custom hardware,
while allowing faster and easier design-space exploration. The key in this process is that these
tools are still hardware development tools. They cannot translate any software to hardware
but rather allow the expression of certain operations in terms of a higher-level language, a
subset of which is synthesizable. If correctly used, by a designer with a deep understanding
of the process and the underlying hardware, the high-level synthesis tools do not adversely
50 Chapter 2. Background
affect an architecture’s throughput per cycle and largely achieve comparable latency results for
most pipelined architectures. Instead, they bring some predictable penalties to the maximum
achievable frequency and certain resource usage overheads. At the same time they allow a re-
searcher to investigate more ambitious and complex designs in a smaller time-frame, potentially
leading to riskier but more promising architectures being explored. These overheads are hard
to quantify as there is at the time of writing limited research work comparing large complicated
designs realised in HLS and traditional HDLs with similar optimisation effort for both.
However, results from the work presented in this thesis put the achievable frequency at the
region of 100-125 MHz which is very close to other image processing designs in the literature
such as Greisen et al. [50] in 2011 and some recent works such as [51] seem to indicate that
while hand-written HDL can achieve very low latency for simple tasks, the ease of development
of HLS tools can actually result in higher performance for more complicated applications by
allowing more time to be allocated to exploring and optimising the resulting hardware instead
of simply implementing and debugging. In Chapters 4 and 5 we will discuss certain design
choices and overheads in the evaluation sections that were identified in our work and were
associated with the choice of HLS as our development language. Nevertheless, as we will discuss
in the following chapters, very high performance levels can still be achieved, which means
the implementations produced by HLS can be directly used without modification for many
applications where development time is of higher importance than the maximum achievable
performance and resources for a certain FPGA.
2.8 Related Work
Since platforms in the embedded space have significant constraints in power and performance,
current embedded SLAM implementations focus on sparse SLAM that is adapted towards re-
ducing computational requirements further such as [52,53] and feature-based implementations
of sparse visual or visual-inertial odometry [54–56] that provide limited or no large-scale map-
ping and reconstruction capabilities. A state of the art work in this category is SVO [57]. It
2.8. Related Work 51
combines a semi-direct technique for tracking that is both robust and efficient compared with
other sparse methods in order to run at a lightweight embedded platform. It achieves great
accuracy and reduced drift in under 20ms in an embedded platform, however it achieves that by
mapping a small number of points, and concentrating on visual odometry. These approaches
map a sparse selection of features that reduces the density of the reconstruction to a small
selection of 3D points and often cannot provide a large-scale coherent map during exploration.
One approach that has been explored is combining efficient visual or visual-inertial odometry
with the option of offloading computation to a base station and reconstructing a dense or global
map there, as for example in Sturm et al. [13] and others [14, 58]. This strategy can address
some applications but comes with increased power consumption for the wireless communication,
as well as increased latency. It also comes with a reduced area of operation, and very high
bandwidth requirements if it needs to operate in real-time, while the odometry cannot benefit
from the denser map to enhance its interaction with the environment. A complete SLAM
solution with higher-density onboard an embedded platform stands to overcome all of these
disadvantages and is necessary for the emerging applications discussed in the introduction.
Dense SLAM has been advancing rapidly in the last decade. There have been examples with
expanding capabilities, such as deformable objects and surfaces for full environment recon-
struction [37] and explicit object modelling. However, its requirements in sensors, energy and
computation are infeasible for an embedded platform. Works in semi-dense methods such as
LSD-SLAM, are more applicable to the embedded space thanks to lower computational com-
plexity and reliance on simpler RGB or greyscale passive camera sensors. LSD-SLAM [17] for
example, provides a tracking accuracy comparable to other state of the art sparse methods but
generates a higher density map that provides significantly more information about the envi-
ronment. As such, it was selected as the target for the custom accelerator presented in this
work.
Executed on a general purpose CPU, LSD-SLAM requires a quad-core high-end x86 CPU to run
at real time. On the embedded space attempts to run SLAM such as this have had to reduce
functionality to simple visual odometry and run at reduced resolution. Since the runtime is
52 Chapter 2. Background
affected almost linearly with the number of points to process, reducing the number of pixels by
a factor of 4 and optimising the algorithms features to run at a mobile device, Schoeps et al
achieved a performance of approximately 20 frames/sec for mapping at a resolution of 320x240
and tracking at 160x120 [59], trading off accuracy and richness for lower complexity. However,
code optimisation is not enough to allow a mobile device to handle the original resolution,
density and other features without a large drop in performance below real-time specifications.
Recently, there have been attempts in designing custom hardware for SLAM in the embedded
space. Suleiman et al. [60] demonstrated a custom ASIC design for visual-inertial odometry
targeting nano-drones. It belongs in the category of sparse odometry and achieves high perfor-
mance together with power efficiency, realised as a chip fabricated on a 65nm CMOS process.
It enables environment awareness for very lightweight robots, but because of its specialisation it
only performs the version of sparse visual-inertial odometry it was designed for and cannot be
extended to semi-dense or dense SLAM. This is a typical example of an optimised ASIC imple-
mentation of an algorithm, which trades flexibility and cost to achieve the highest performance
and power efficiency for a specific task.
Moving from ASICs to reconfigurable hardware, work on FPGAs in the past has been limited
in scope to accelerating selected computation kernels for sparse SLAM as pre- or co-processors
such as [61,62]. For example, FPGAs have been used to implement feature detectors based on
SURF [63], SIFT [64] and many others, as well as stereo disparity estimation such as [50, 65].
A very interesting example of both, where the FPGA is acting as a coprocessor is [66] where
FPGAs are considered for Autonomous Planetary Rover Navigation. In this line of thought,
Honegger et al. [67] proposed a custom board combining an FPGA and a mobile CPU for
robotic vision, evaluated by offloading a popular type of disparity estimation algorithm (SGM
stereo) to the FPGA. One characteristic of this architecture is a one-way link with the FPGA
between the camera and off-chip memory. This can cover some use cases, for example the
pre-computation of feature extraction, but is limited in the amount of processing it can do.
In contrast, we target a more flexible system architecture that can allow more fine-grained
cooperation between hardware and software.
2.8. Related Work 53
These works also cannot directly lead to significantly low power, high-density SLAM since they
deal with only pre-processing steps, leaving a large percentage of the computation to the CPU
that has to receive this data. This means their resulting performance-per-watt will remain close
to that of the general purpose CPU if a large percentage of the computation time is spent there.
The work presented in this thesis targets more complete functionality from the accelerators’ side
in order to achieve very high throughput levels with low latency between frames, at a much
lower power consumption, targeting smaller and lighter platforms. Moreover our approach
avoids the pitfall of requiring hardware based integration and knowledge, instead exposing the
accelerator as a C++ function with an integrated driver underneath, in a full Linux-based
operating system. This makes it more approachable to the computer vision community by
reducing the difficulty and expertise required in integrating it with existing software-based
frameworks.
Before our work, to the best of our knowledge, there was no hardware implementation of a
complete SLAM algorithm, and looking at visual-inertial odometry the only complete hardware
implementation on a chip is Navion [60], mentioned above. In contrast, our work targets a more
complete implementation of a the semi-dense tracking and mapping tasks. Some sub-tasks are
still allocated to the mobile CPU where that choice is more efficient, however the accelerators
presented in this thesis are an almost complete implementation of state-of-the-art semi-dense
tracking and semi-dense mapping and operate with a much more fine-grained and efficient
cooperation with the CPU.
Evaluation of different works in the field of SLAM
One major challenge in comparing these works with ours is the variation in different evaluation
methods and in information reported in publications. Some software-based works include some
newer standardized benchmarks of the SLAM space, such as the TUM datasets [68], the ICL-
NUIM dataset from Imperial College London and the National University of Ireland Maynooth
[46] and the EuRoC MAV dataset [69]. Newest open versions of all of these sets can be found
in the respective websites of the academic institutions mentioned in the papers referenced here.
54 Chapter 2. Background
These datasets were a major step forward in comparing different SLAM methods, offering a hard
ground truth and multiple measures of the trajectory error of an algorithm over the lifetime of a
dataset. However, different papers might use only a small subset of these, or initialise their work
with different conditions, in an approach that can distort the final results. Moreover, looking at
the performance of different methods, papers published in computer vision and robotics venues
often neglect to provide a detailed or rigorous evaluation of the computational performance of
their methods. So while some works will clearly state average and even performance over time,
and describe their platforms in terms of exact CPU model and amount of memory such as
ORB-SLAM [41] and Whelan et al. [36], which can be used to extract an accurate estimate of
performance requirements, others are more vague. Also, information on optimisation levels and
the optimisation effort from the designer are often missing and can vary significantly between
implementations.
Evaluation of the work presented in this thesis
As such, the metrics with which we compare this work to related works and evaluate it are as
follows. First, since most low-power solutions focus on sparse SLAM or visual odometry, one
level of comparison is in terms of features and tracking/map density. The work presented in
this thesis aims to achieve the same power consumption and tracking latency as these works
but offering the full range of features in a state-of-the-art SLAM and a richer semi-dense
map reconstruction. As we have not modified the algorithm this work was based on, but
chose to remain faithful to the original implementation, the quality and accuracy of SLAM
utilizing the presented architectures is the same as the base algorithm, LSD-SLAM [17]. In the
different published works and presented architectures care was taken to test and establish this
equivalency to the software solution in terms of functionality and results.
Secondly, an important aspect of this work is its power requirements and performance-per-watt.
Using information provided in different papers we can attempt to create a map of the typical
power consumption of different platforms used. For instance, by knowing that a laptop version
of an Intel i7-4700MQ CPU was used for all experiments, a typical expected power consumption
2.8. Related Work 55
should be in the range of 36-47W with a high multi-threaded load 2 such as the one LSD-SLAM
would produce. This method can introduce errors of a few percentage points but should give
a consistent estimate of the order of magnitude of power requirements for SLAM algorithms
published in the last few years. For papers that focus more on power consumption and report
power measurements the comparison is with the published power figures directly. In the next
two chapters, for the presented architectures power was estimated from the synthesis and place-
and-route tools, and in Chapter 5 power was measured directly at the wall for different test
platforms and our implemented accelerators while executing LSD-SLAM to supplement these
estimates with real-world measurements.
Finally, the achieved performance of the presented work is evaluated in every chapter. This
is measured in terms of the latency to track a frame or update the Keyframe’s map, or the
inverse of that latency as frames per second for the tracking and mapping tasks’ throughput.
Since performance varies significantly with the features, map density and used resolution for
different algorithms, this part of the evaluation is a direct comparison with LSD-SLAM as pure
software, and the proposed hardware-accelerated version for each chapter. For Chapters 3 and
4 the ARM Cortex-A9 dual core CPU in Xilinx Zynq SoCs is used to provide a baseline for the
software version. In Chapter 5 finally, a high-end desktop CPU (Intel i7-4770) and a high-end
quad-core mobile CPU (ARM Cortex-A57) are used for performance comparisons. The first of
the two establishes the performance achievable from an optimised implementation on a modern
CPU if power is not a constraint, while the second showcases the performance achievable from
an optimised implementation on a modern high-end mobile CPU used in a high-performance
mobile system where performance-per-watt is important. Comparing to the literature, Engel et
al. [17] report performance in the region of 30 frames/sec. for the same resolution as our tests,
using a high-end quad core laptop CPU that as we discuss later in this section has a typical
power consumption of 35-45 W.
The above experiments were conducted with a selection of datasets that represent two different
real-world environments; the first one is in a small but cluttered room indoors while the second
2For this figure we utilise the manufacturer’s thermal design power and specifications [70], with an estimatedload of 60-80% across at least three out of four cores
56 Chapter 2. Background
one is in an outdoors area with loop-closures (revisiting the same space multiple times), with
a combination of close objects, medium-distance buildings and far-away features such as trees
and clouds. These were provided by Technical University Munich (TUM), by the authors of
LSD-SLAM on TUM’s website3. These datasets were used for their quality and the variety
of environments they contain to provide measurements for the performance of the proposed
hardware compared to a software-only version, and to ensure equivalency with the software
implementation.
As such, the comparison to related work is made in two dimensions. The first one is to compare
the overall solution our work provides, in terms of map density, features, and with the accuracy
of a state-of-the-art method, with other embedded and desktop solutions. In this dimension
we execute a more complex, higher-density SLAM algorithm compared to the state of the art
in embedded SLAM, but at the power requirements of mobile, sparse visual odometry. The
second dimension is the performance-per-watt improvement of running LSD-SLAM compared
to current general purpose platforms. On this dimension, we achieve a rate of processing 5 times
higher than that of a high-end mobile CPU (Quad-core ARM [email protected]) with a
comparable and lower power consumption. Meanwhile, the proposed accelerator matches the
performance of a high-end desktop x86 CPU (Intel i7-4770 @ 3.77GHz) but with an order of
magnitude less power consumption, for an overall improvement of an order of magnitude in
performance per watt compared to general-purpose CPUs for advanced semi-dense SLAM.
An overview of the field, giving a comparison across these two dimensions, and combining
hardware and software implementations across different SLAM categories is given in Table 2.1.
As will be discussed in Chapter 5, where a more concise version of such a table is used, it is not
meant to be exhaustive or rank the works. It is instead compiled to showcase the breadth of the
field, and the features and completeness that different algorithms offer towards environment
aware applications. Our work fits between the two following main approaches, positioned to
bridge the large gap in SLAM capabilities between research in algorithms and applications in
embedded systems.
3https://vision.in.tum.de/research/vslam/lsdslam
2.8. Related Work 57
On one side there are dense examples with advanced features, such as tracking over deformable
surfaces [37] and even using a persistent 3d graph map of arbitrary reconstructed objects [38].
These approaches however are utilizing the latest and highest performance CPUs and GPUs
for acceleration with power requirements in the hundreds of watts, and usually necessitate
specialised RGB-D cameras and depth sensors that are heavy and power hungry in their own
right. On the other side, works targeting lightweight micro aerial vehicles (MAVs), medium-
sized robotics and augmented/virtual reality have so far focused purely on sparse features used
for accurate tracking, with very limited large-scale capabilities and a sparse map as a result.
With the architectures presented in this thesis, by significantly raising the performance-per-watt
capabilities of a platform for SLAM utilizing custom hardware, we can provide semi-dense, large-
scale mapping capabilities with only a passive monocular camera, and a power consumption in
the single-digits when realised on an off-the-shelf FPGA-SoC. As such, Table 2.1 showcases the
positioning of works in SLAM and visual odometry targeting a pure software implementation,
and embedded platform and hardware acceleration, and their positioning in the two dimensions
of performance-per-watt and capabilities/density, with the final line being our work as presented
in Chapter 5 combined with the work from Chapter 4.
58 Chapter 2. Background
Wor
kT
yp
eH
ardw
are
Pla
t.D
ensi
tyL
arge
-sca
leM
onocu
lar
Typic
alP
ower
Per
form
ance
Alg
orit
hmic
OR
B-S
LA
M[4
1]SL
AM
Lap
top
CP
USpar
seX
Xap
pro
x.
38-4
7W30
fps
trac
k.
2.2
fps
map
.
LSD
-SL
AM
[17]
SL
AM
Lap
top
CP
USem
i-den
seX
Xap
p.
40-5
0W30
fps
Whel
anet
al.
[36]
SL
AM
GP
UA
ccel
erat
edD
ense
Xap
p.
230-
360W
30+
fps
Fusi
on+
+[3
8]SL
AM
GP
UA
ccel
erat
ion
Den
seap
p.
250-
400W
4-8
fps
Em
bedd
ed
Leu
teneg
ger
etal
.[4
5]SL
AM
Lap
top
CP
USpar
seX
Xap
p.
30-5
0W20
fps
(sca
l.)
SV
O[5
7]O
dom
etry
Lap
top
Spar
seX
app.
30-4
0W16
6fp
s(a
cc.)
SV
O[5
7]O
dom
etry
Jet
son-T
X1
Spar
seX
app.
10-1
5W55
fps
(fas
t)
Bar
ryet
al.
[71]
Obst
acle
Det
.2x
OD
RO
ID-U
2Spar
seap
p.
20W
120
fps
Har
dwar
e
Web
erru
sset
al.
[61]
Fea
ture
Extr
.F
PG
AN
/AN
/AN
/A5.
3W2
ms
feat
.ex
tr.
Nav
ion
[60]
Odom
etry
ASIC
(65n
mC
MO
S)
Spar
seX
24m
W17
1fp
s
Ole
ynik
ova
etal
.[7
2]O
bst
acle
Det
.M
obile
CP
U+
FP
GA
No
map
app.
14-2
0W60
fps
Th
isw
ork
SL
AM
FP
GA
SoC
Sem
i-den
seX
X6-
7W60
-100
fps
Tab
le2.
1:Sta
te-o
f-th
e-ar
tSL
AM
exam
ple
s.C
ompiled
wit
ha
focu
son
feat
ure
san
dch
arac
teri
stic
sof
diff
eren
tso
luti
ons
todem
onst
rate
the
bre
adth
ofth
efiel
din
term
sof
feat
ure
san
dp
ower
typic
alor
rep
orte
dw
her
eav
aila
ble
pow
erre
quir
emen
ts.
Com
par
ison
wit
hca
mer
are
solu
tion
inth
esa
me
regi
onof
MP
ixel
s.
Chapter 3
Accelerating semi-dense SLAM
Simultaneous localisation and mapping (SLAM) is central to many emerg-
ing applications such as autonomous robotics and augmented reality. These
require an accurate and information rich reconstruction of the environment
which is not provided by the current state-of-the-art sparse, feature-based
methods in the embedded space. SLAM needs to be performed at real-time,
with a latency low enough to ensure small translation and rotation between
successive camera frames. At the same time, dense SLAM that can provide a
high level of reconstruction quality and completeness comes with high compu-
tational and power requirements and the available platforms in the embedded
space often come with significant power and weight constraints. Towards
overcoming this challenge, the first part of this chapter discusses the charac-
teristics and computational patterns of this type of application, focusing on
a state-of-the-art semi-dense direct SLAM algorithm, a novel category pro-
viding a much denser map than traditional feature-based methods but in a
more efficient formulation than other state of the art dense SLAM methods.
The second part discusses our work on designing a custom hardware accel-
erator, based on an FPGA-SoC, to provide semi-dense SLAM functionality
at a much lower power level.
59
60 Chapter 3. Accelerating semi-dense SLAM
Finally, an accelerator for semi-dense, direct tracking is evaluated. The
achieved acceleration is discussed, together with the bottlenecks and chal-
lenges that were encountered.
This chapter is based on a conference paper co-authored by Christos-Savvas
Bouganis [Boikos Konstantinos et al., FPL, IEEE, 2016 [18]]
3.1 Motivation
Semi-dense and dense SLAM is crucial in many emerging applications of robotics and embedded
vision. However, state-of-the-art SLAM comes with high computational requirements, requiring
high-performance CPUs for modern sparse and semi-dense methods, and GPU acceleration for
dense real-time SLAM. At the same time UAVs and other robots impose tight constraints on
the power and weight of the electronics on board. Some of them also impose strict latency
constraints, while other applications that require some form of SLAM, such as augmented
reality, have even higher power and weight constraints due to being worn on the head and
being passively cooled [73].
These requirements, which are not met by software optimisation or advances in general-purpose
silicon, can be solved by designing new hardware platforms, specialised for this task. In the last
decade, major FPGA companies introduced a new type of embedded System-on-Chip (SoC),
one that combines an embedded low-power CPU integrated with programmable logic on the
same chip. Utilizing the programmable logic, custom specialised hardware accelerators can be
designed to act as coprocessors along the mobile CPU, to combine the best of both worlds.
This type of heterogeneous system, while having low enough power requirements to fit low-
power platforms, stands to substantially improve performance compared to general purpose
embedded CPUs/GPUs and bring more advanced and computationally demanding algorithms
to the embedded space. As we have discussed in the first two chapters, custom hardware in
the form of an FPGA or ASIC can provide a high level of absolute performance but, more
importantly, stands to significantly improve performance per watt, meaning light and more
3.1. Motivation 61
power constrained platforms can be targeted.
SLAM is mainly composed of two strongly interdependent tasks, tracking and mapping. They
share similar low-latency requirements and are both computationally intensive. As will be
demonstrated by profiling results in this chapter, together they share more than 80% of the
computation time of real-time SLAM. They are also the two tasks that need to happen in
real-time during exploration, as discussed in the background chapter, to keep track of a moving
robot or platform, while the global map maintenance can be allocated to background tasks or
even done offline. As such, they both need to be accelerated to achieve a high-performance
low-power solution in the embedded space.
In this thesis, both tasks are addressed by complementary accelerators. However, the task
of tracking was targeted first, for the following reasons. Fast and accurate tracking is crucial
in SLAM and must be established before accurate mapping can be introduced. Firstly, it is
the entry point of all new information. If tracking skips a frame, the information it contained
cannot be used at any other point in the algorithm. Tracking has to be fast enough, with a
performance of at least 30 frames per second depending on the speed of the moving robots, in
order to keep up with the camera movement, as is demonstrated for example in [39].
Direct tracking is also based on an assumption that only a small amount of translation has
occurred between frames [1], making a high rate of processing even more crucial for fast-moving
cameras. Faster moving platforms, such as fixed-wing aircraft, necessitate even faster sustained
tracking rates than that. In the work of Andrew Barry [6] for example, stereo-based localisation
and mapping for obstacle avoidance on a fixed-wing drone processes at a rate of 120 frames per
second. Therefore, in Chapters 3 and 4 we focus on first designing an accelerator to provide
high-performance low-power tracking. In Chapter 5, the task of mapping is also addressed,
with a specialised high-throughput accelerator for semi-dense depth map estimation.
This chapter investigates the performance and acceleration opportunities for a complex, opti-
mised, state of the art semi-Dense SLAM algorithm, based on LSD-SLAM by Engel et al. [17]
with custom hardware accelerators based on an FPGA SoC. This algorithm was chosen since,
as has been discussed in the Background chapter, it offers a dense map, more suitable for
62 Chapter 3. Accelerating semi-dense SLAM
applications in robotics and its direct tracking formulation is better generalisable to a variety
of indoor and outdoor environments. Furthermore, the fact that it only requires an off-the-
shelf monocular camera sensor means that the weight and power requirements are significantly
reduced before the processing is introduced, as discussed in Chapter 2.
We begin the chapter by characterising the performance and computation patterns of the
open-source version of LSD-SLAM, a hand-optimised, multi-threaded, vectorised software im-
plementation. We then present a custom, fully-pipelined architecture targeting the three key
functions present in the tracking task. Finally, there is an analysis of the performance and
performance per watt of the designed accelerators and the bottlenecks at the accelerator and
system level.
The key contributions presented in this chapter are the following:
• First the characterisation and analysis of a state-of-the-art semi-dense SLAM algorithm
• Second, an architecture based on an FPGA system-on-chip that formed the first acceler-
ator design, to the best of our knowledge, to target state-of-the-art semi-dense SLAM.
• Third, the identification of opportunities and bottlenecks of heterogeneous FPGA SoCs
for such algorithms.
3.2 Tracking Algorithm in LSD-SLAM
LSD-SLAM is a semi-dense, large-scale direct SLAM algorithm with a monocular camera input.
Its tracking method is direct whole-image alignment with a photometric residual to recover
an estimate for the 6-degree-of-freedom pose of the camera. It subsequently uses that pose
estimate to update the current depth map estimate. The mapping task uses the recently tracked
frame with its recovered pose estimate to perform frame-to-frame epipolar stereo. For each
successful observation across the two views a depth estimate is recovered and a filtering depth
update is performed for already observed points to improve the current estimate. Mapping also
3.2. Tracking Algorithm in LSD-SLAM 63
keeps track of the uncertainty of the depth estimate for each mapped pixel, treating the depth
observation as a Gaussian probability distribution. This depth and uncertainty information is
stored in a data structure called a Keyframe, which contains, bundled with the pixel values of
a frame, depth and depth variance information as well. Finally, the most recent copy of that
depth information is the input of the tracking task to track the next incoming frame, closing
the information loop of simultaneous localisation and mapping.
The principles behind this process were presented in Chapter 2. For the rest of the thesis we
will focus on the elements of the algorithm and its software implementation which are relevant
to its performance and acceleration. This chapter will focus more on elements of the tracking
task important to the accelerator architecture presented on Section 3.4. Information regarding
the algorithm and implementation of the mapping task will be presented in a more complete
form in Chapter 5, with only some high-level information repeated here as necessary to discuss
the architecture, profiling results and the performance of the system.
3.2.1 Tracking in LSD-SLAM
In Section 2.4 we introduced an algorithmic overview of LSD-SLAM. This section will discuss,
from a high level perspective, the computations involved in the software version of LSD-SLAM
and introduce a pseudocode representation for the main calculations and data involved.
The core of the tracking algorithm is the iterative optimisation process presented in Chapter 2.
The pose is recovered by aligning the frame that is tracked with the current Keyframe and
attempting to minimise the variance-normalised squared error of the intensity differences. Every
valid 3D point of the map is part of the input. Utilizing the depth, variance and intensity values
of the map points, they are reprojected to the plane of the current camera frame to calculate
the photometric error and gradient of the reprojection.
The tracking task is composed of three main sub-functions, a residual and gradient calculation
function, a weight and normalising factor calculation function and a function that generates
the linear system for optimisation. The first one will be referred to as residual calculation, the
64 Chapter 3. Accelerating semi-dense SLAM
second one as weight calculation and the third one as linear system generation throughout the
thesis. The first two are essentially performing computations on the same data points, and are
responsible for the reprojection and error calculation mentioned above. They were separated in
the open source implementation available of LSD-SLAM but are always called sequentially, so
in designing an accelerator they were looked into as one task and combined in hardware in one
set of operations. The last one is called only for a subset of the iterative optimisation process
described in Chapter 2.
The control flow of implementing this iterative process of the Levenberg-Marquardt [48] opti-
misation is as follows. The error and gradients are calculated first, followed by the weighting
factors for the weighted least squares formulation. Optimising for a solution to the pose esti-
mate makes use of a second-order approximation of the error function. From that, a damped
linear system (before the solving of the system, a damping factor is added in the form of a
normalising constant subtracted from the main diagonal) is constructed in the final software
function. If its solution fails to decrease the error (which is verified from a re-run of the first
two functions which calculate the total weighted error), the damping parameter is changed and
the error calculation is executed again, without the task of linear system generation. If the
error is decreased, an update to the pose estimate is finally calculated and a new optimisation
step is generated from the new position. This process is summarised in Fig. 3.1.
The functionality surrounding the main computations only takes a few percentage points of
the computation time on a mobile CPU but would still require significant resources, especially
targeting a smaller FPGA, to be mapped to hardware. Furthermore, the computation in the
residual and weight calculation functions happens more often than the linear system function,
and the latter will be called after the other two finish and the final error is known. This led to
the decision for the work presented in this Chapter, presented in more detail in Section 3.4, of
designing a separate pipeline for each of these two main operations, and separate them from
the rest of the control. Thus the residual and weight calculation together are calculated in
one pipeline and the linear system generation happens in another. The rest of the operations
were left in software due to their low relative computational cost, identified with timing and
profiling results presented in the next section, and to avoid complex control being unnecessarily
3.2. Tracking Algorithm in LSD-SLAM 65
Project pixels
using depth and pose estimate
Interpolate intensity and gradient for reprojection
error
Calculate weights (W), residual and
Jacobian vectors
Using a Taylor expansion approximation:
𝑯𝜹 + 𝒃 = 0 𝜹 = −𝐇−𝟏𝒃 ,
or 𝜹 = − 𝑱𝑻𝑾𝑱 + 𝜆𝑰−1𝑱𝑻𝑾𝒓 𝝃𝑛
Check convergence and adapt λ
Solve Linear Systemfor new
poseestimate
Apply new λ
Error decreased?
No
Yes
Residual and weight calculation functions
Generate linear system function
Figure 3.1: Control flow of tracking task. The three main sub-functions are described inside thedashed lines. The rest of the computation and control described in the figure happens outsidethese functions.
implemented in an FPGA fabric.
The process is repeated for different resolutions from small to large, for the pyramid processing
we have discussed in Chapter 2. Every step of the optimisation process across iterations and
pyramid levels is dependent on the output of the previous one. This led to the decision to
parametrise the accelerators in a resolution agnostic way. However, it still meant that the
computation for different pyramid levels could not be overlapped, but had to be performed
sequentially. The control of this process presents a large number of possible paths but is not
computationally demanding so the mobile CPU was chosen as a better fit to handle the higher-
level decisions.
3.2.2 The tracking algorithm
In the interest of brevity and clarity of presentation not all variables and calculations are
shown. Rather, the focus is on the details relevant to the research and discussion presented in
this thesis. All threshold variables are also potentially tunable parameters that can trade-off
quality and performance. As the focus of this research is on hardware design, the tuning of
66 Chapter 3. Accelerating semi-dense SLAM
these parameters to different robotic systems and environments is outside the scope of this
thesis, and orthogonal to our work. The variables and data structures used below are the same
that were defined in Section 2.4. The figure included in that chapter is copied here for reference,
as Fig. 3.2.
Mapping
Capture new
image
Secondary background functions
GUI
Loop closure detection
New Camera Frame
Tracking
Get tracking reference
Track Frame
Calculate Keyframe closeness
score
New Frame + Pose
Generate tracking
reference Update Keyframe
Create new Keyframe
Keyframe, max gradients and pose
Keyframe data structure
Distance score over threshold ?
NoYes
Unmapped tracked Framesqueue
Retire previous Keyframe
Graph optimisation
Keyframe graph
Figure 3.2: SLAM algorithmic overview. The main tasks are inside the two dashed rectangles.The main data structures are indicated with orange coloured boxes, while the light blue boxesrepresent the main tasks necessary to perform tracking and mapping. With grey we indicatesome of the background tasks involved in full SLAM.
This section focuses on the complexity of the track Frame function, and its dependency on the
amount of valid Keypoints as well as the current pyramid level and the system’s max resolution
for tracking. For the theoretical background and functionality of the tracking task one can refer
to Section 2.5. This section will only focus on the computations involved in implementing that
functionality.
The first sub-function, the calculate residuals function, calculates the values in equation 3.1,
and also calculates other intermediate values, such as the intensity derivatives dx, dy, for the
eventual construction of the sum of weighted, squared residuals and the linear system to solve,
both of which happen in separate functions, and buffers them in memory in a result buffers
data structure.
3.2. Tracking Algorithm in LSD-SLAM 67
rp(p, ξji) := Ii(p)− Ij(ω(p, Di(p), ξji)) (3.1)
The calculate residuals function is presented with pseudocode in algorithm 2. The main part of
it is the calculations responsible for the reprojection of Keypoints to the new Frame’s projective
plan, leading to the residuals used in the sum of weighted residuals, in equation 3.2, used in the
optimisation process. It is always followed, after a check for divergence (early error estimate
significantly high), by the weight and residual calculation function. That function, presented
here as algorithm 3, uses the intermediate results of the residual calculation function to calculate
the overall weighting factor for each residual, integrating it in the result buffers data structure,
and calculates the final normalised error in equation 3.2.
Ep(ξji) =∑p∈ΩDi
∥∥∥∥∥w(p,ξji)r2p(p,ξji)
σ2rp (p,ξji)
∥∥∥∥∥δ
(3.2)
The total error calculation is repeated in the Calculate Linear System function once an incre-
ment is accepted. This last function, in algorithm 4, calculates the linear system to solve to
generate the update equation 3.3, without including the normalizing factor λI, that is added
outside. J in equation 3.3 is the stacked matrix of all Ji pixel-wise Jacobians and Wi,j is the
diagonal matrix of weights with Wi,i = w(residuali).
δξ(n) = −(JTWJ + λI)−1JTWr(ξ(n)) (3.3)
As we can see in algorithm 1, the linear system calculation is not repeated for every optimisa-
tion step. Rather, it happens for the first step, and then the increment is temporarily applied
to the pose. From that new pose estimate, the residuals and the error are calculated again from
algorithms 2 and 3. If that pose decreases the error it is accepted, and a new linear system
is generated to calculate the next increment from that position in the optimisation space. If
it increases the error it is rejected and a new normalization factor λ is applied to the pre-
viously generated linear system, essentially transitioning between Gauss-Newton optimisation
68 Chapter 3. Accelerating semi-dense SLAM
and gradient descent.
Algorithm 1 Track Frame Function
procedure TrackFrame(Keyframe, last pose, new Frame,K)#K is the intrinsic camera matrix
pose estimate = last posereference = Subsample and copy valid Keypoints(Keyframe) #Gen. tracking reference
for pyramid level = max level to min level do#Intermediate results, such as residuals and gradients are buffered in this function
(residual, result buffers) = Calculate residuals (new Frame, reference, pose estimate,K)Check for divergence (residual)(error, weights) = Calculate weights and residual (new Frame, reference,
pose estimate, result buffers)λ = 1while iteration < max iterations AND converged = False do
(A, b) = Calculate Linear System(result buffers) #First order Taylor approx.
while error not decreased doA = A+ λIincrement = solve (A, b)new pose = pose estimate ∗ increment #Exponential Mapping
Calculate residuals (new Frame, reference, new pose,K)Check for divergence (residual)Calculate weights and residual (new Frame, reference, new pose, result buffers)if λ ≤ 0.2 then
λ = 0else
λ = λ ∗ 12
#Tunable
if error < last error thenpose estimate = new poseerror not decreased = Falseif error/last error > conv threshold then
converged = True
elseif increment < increment threshold then
converged = True
3.2. Tracking Algorithm in LSD-SLAM 69
Algorithm 2 Calculate Residuals Function
procedure Calculate Residuals(new Frame, reference, pose estimate,K)for all Keypoints in reference do #Keypoints with a valid observation
Reproject Keypoint to new Frame(reference[Keypoint]) # Use current pose estimate
Interpolate(I, dx, dy) # Subpixel interpolation for intensity and gradients
Calculate Intensity residuals()Calculate sum of squared values to buffer()Calculate fitness score()residuals[Keypoint] = residuals[Keypoint] ∗ fitnessResult buffer[Keypoint].add(residuals, squares, I, dx, dy)if fitness score < fitness threshold then
Result buffer[Keypoint].skip = True
#First warning of divergence comes from sum of squared residuals
residual = Sum(squared residuals) / number of good pointsreturn residual, number of good points, Result buffer
Algorithm 3 Calculate Weights and Residual Function
procedure Calculate Weights and Residual(new Frame, reference, Result buffer)points = 0for all results in Result buffer do
points = points+1error = Result buffer[Keypoint].residualvariance = reference[Keypoint].depth varianceCalculate weight(variance, error)Add pixel noise estimate to weight(weight)weighted error = Huber(error ∗ weight) #Huber Norm
summed error = summed error + weighted error ∗ weighted errorResult buffer[Keypoint].weight = weight;
return Result buffer, summed error/points
Algorithm 4 Calculate Linear System Function
procedure Calculate Linear System(Result buffer)for all results in Result buffer do
Ji = Calculate Elements of Jacobian vector(result)residual = result.residualweight = result.weightA = A+ Ji ∗ JTi ∗ weightb = b− Ji ∗ residual ∗ weighterror = error + residual ∗ residual ∗ weight
return A, b, error
Mathematical description of the maximum gradient calculation
Finally, we will briefly describe an important aspect of the mapping function, which is the
pixel selection to generate Keypoints. This will be discussed again in terms of computational
70 Chapter 3. Accelerating semi-dense SLAM
requirements in Chapter 5, but is included here to illustrate its effects on the reference data
structure that tracking uses extensively as a local copy of the map stored in the Keyframe. The
pixel selection is as follows, from the maximum gradient in the neighbourhood of a pixel with
coordinates (x, y), where I is the light intensity of that location.
dx = I(x+ 1, y)− I(x− 1, y)
dy = I(x, y + 1)− I(x, y − 1)
gradient(x, y) =√dx2 + dy2
max vert(x, y) = max( gradient(x, y), gradient(x, y + 1), gradient(x, y − 1) )
max gradient(x, y) = max( max vert(x, y),max vert(x− 1, y),max vert(x+ 1, y) )
Algorithm 5 Select pixels to map and use for tracking
for all Keypoints k(x, y) in Keyframe doif max gradient k(x, y) > gradient threshold then
result = attempt mapping(k)if result == successful then
k.valid = 1
3.2.3 Mapping and Global Optimisation
The pose estimation process of tracking, with the exception of an initialization phase that
happens only at the beginning, is done with respect to an existing Keyframe, a data structure
encapsulating depth and variance estimates and the frame with which they are made, as was
defined in Chapter 2. This Keyframe is used as input, but at the same time has to be con-
tinuously updated. This results in a constant synchronisation for the software implementation
which comes with a cost in memory traffic and latency for both tasks.
In addition to this, when a sufficient distance/disparity exists between the Keyframe and the
current camera frame, the Keyframe is replaced. At that point, the depth values need to
be propagated to the new frame. This is similar to the reprojection step that happens in
3.3. Profiling and Performance Analysis 71
tracking with the addition of adding a depth estimate at the location of projection in the
new Keyframe and increasing slightly the variance estimate. An extra normalisation step is
added to this process to ensure the average inverse depth value is 1.0 to avoid problems to do
with scale ambiguity, an inherent characteristic of monocular SLAM algorithms. During that
process, the tracking function needs to maintain the last valid copy of the map to function,
or alternatively wait out this costly process. In general, the accelerator for tracking has to be
able to co-operate and continuously synchronise its inputs and outputs in millisecond latencies
with multi-threaded functions operating on various buffers that are not guaranteed to be in the
same memory location, and occupy non-contiguous virtual memory pages. This can be one of
the key sources of delays in an accelerator’s real-world performance as will be discussed in the
evaluation section.
3.3 Profiling and Performance Analysis
The source code used for the analysis in this chapter is the open-source version provided by the
authors of LSD-SLAM for research purposes. It consists of highly optimized C++ code, with
some tasks in hand-written assembly utilizing vector instructions targeting modern multicore
CPUs. The task of mapping iterates over all pixels in the Keyframe serially and the calculations
for each pixel are independent from others. Thus, it can be parallelised on the loop iteration level
and in the software implementation it is mapped to multiple-threads processing different sets of
image rows in parallel. On the other hand, most of the functions involved in tracking need the
result of one loop iteration to proceed to the next, hence they need to execute serially. These
took advantage of hand-written vectorised routines, to attempt to maximise the attainable
performance. In total there were three versions for the computationally demanding functions
in tracking; one in standard C++, one utilizing SSE x86 instructions and one utilizing NEON
instructions targetting ARM CPUs.
As an initial step in the profiling of the implementation, the provided source code was compiled,
run and profiled on a desktop system. Eventually all versions were compiled with the maximum
72 Chapter 3. Accelerating semi-dense SLAM
optimisation enabled on the compiler as well as platform specific optimisations, such as using
the hard FPU unit on the ARM core tested in the next sections. The algorithm was timed and
profiled on a desktop PC using the vectorised implementation targeting an Intel i7-4770 CPU
and both the NEON and standard C++ implementations were tested and timed on a dual-core
ARM Cortex-A9. This was done to present highly-optimised software results in both high-end
desktop and mobile platforms and have a fair comparison point to the accelerators.
3.3.1 Profiling Tools
The main tool that used to obtain profiling results was Callgrind from the Valgrind suite [74].
This tool runs the executable in a controlled virtual environment where it counts the number
of CPU cycles spent on different instructions and matches them to lines of code using debug
symbols. It then aggregates these costs per function call, and can offer annotated code views
with assembly matched to C++ code. It also works well with multi-threaded applications which
was crucial for this implementation and it provides a good analysis of the computationally
intensive parts of an application. For this type of application it gives us a comprehensive view
of where performance is needed, and where the most computationally intensive parts are.
In the profiling results, tracking in LSD-SLAM was shown to account for the majority of the
total computation time together with mapping. Timing the different functions involved, it was
verified that most of the execution time was spread in the different components of the two
main functions encapsulating the tracking and depth map update tasks. Data collected from
the Valgrind / Callgrind tool is summarised in Table 3.1. Because of the multi-threaded nature
of the workload, and because the test cases run here included reading and decoding an image
from disk due to using pre-loaded datasets, the numbers do not perfectly reflect the importance
of each task to real-time SLAM but need to be interpreted in that light.
Firstly, the image input tasks would not appear in online SLAM and would be replaced with
loading an image from memory, copied there from a camera which would have a much smaller
cost. Furthermore, that cost can be overlapped with the tracking computation after the first
image streams in. As soon as the first image loads, tracking can begin operating on that while
3.3. Profiling and Performance Analysis 73
the second frame loads in a different buffer. Thus, as long as both tasks are faster than the
target frame rate, and since they are completely independent, this will not affect the throughput
of the system. There are also background workloads included in Table 3.1. Some are part of
SLAM such as the pose-graph optimisation and some are not, such as the visualisation task.
These are not immediately dependent on the online result of tracking and mapping, and unless
we have a tracking failure, tracking does not need to communicate with them. As such, the
performance of these background tasks does not have to affect the performance of the live
odometry happening in tracking and mapping.
This has to be interpreted by taking in consideration the hardware utilized. Although in
domain-specific cases such as control and automotive, and ultra-low power applications target-
ting milliwatt grade power consumption one might still encounter single-core microcontroller
grade CPUs, most low-power application-grade CPUs these days have a minimum of two cores
and at the time of writing this thesis it is hard to find a mobile CPU even in a smartphone that
does not come with 4 physical cores as standard. Since this is the type of CPU that will be
targeted for even a sparse SLAM algorithm to achieve a real-time performance, as a microcon-
troller could not cope, this means that with split of 4 tasks, two for SLAM, one for handling
communication with the outside world and one for background optimisation, the tracking and
mapping tasks will have a physical CPU to themselves for the majority of the time.
As a result of the above, discussing the acceleration and the framerate of SLAM, one has to
focus on two metrics. Firstly, the latency of executing each task on software or in a dedicated
hardware accelerator. But secondly, it becomes increasingly important in a data-intensive
application such as this how well the two live and interdependent tasks, tracking and mapping,
can overlap and share data and their respective throughputs. This is because in software, and
as we will discuss in the next chapters in hardware as well, they will take a comparable amount
of time to execute. As a result, their running serially instead of overlapped would effectively
mean a significant reduction in performance, since the processing of one frame through both
would have the latency of the two tasks combined instead of the latency of the slower task.
If we focus on the computation time that is necessary for real-time SLAM and ignore costs of a
74 Chapter 3. Accelerating semi-dense SLAM
Task or task group Percentage of computation
Frame Tracking 31%
Depth Estimation 33%
Image Decode and Input 10%
Frame Creation 2%
Graph Optimisation and other tasks 9%
Visualisation and Graphics 10%
Table 3.1: Profiling Results - Callgrind / x86 Intel CPU
graphical user interface (GUI), visualisations of data and image file decoding, we end-up with
a split that is heavily focused in the two main tasks, tracking and mapping, each composed
of a few interdependent functions. The data can then be normalised to discuss percentages in
relation to the computation time for tasks necessary to perform SLAM. The resulting normalised
percentages are approximately 44% for the mapping task and 41% for the tracking task, with
the rest attributed to background tasks. This is visualised in Fig. 3.3.
Mapping
44.4%
Tracking41.3%
Background Tasks
14.3%
Figure 3.3: Percentage of total computation for the tasks comprising SLAM
3.3. Profiling and Performance Analysis 75
3.3.2 Timing Results
Guided by the profiling results, the execution of key functions was timed on both the desktop
and the embedded CPU. This was done to complement the profiling results by providing a
runtime performance figure and to provide a point of comparison between an embedded platform
and a high-end desktop machine for this type of application. In tables 3.2 and 3.3 we can see a
summary of average execution time for the x86 and the ARM processor respectively. Execution
times of different sub-functions were measured and were combined under the name of two
respective tasks. The main takeaway from timing all the internal sub-functions is that most
of the computation load was split between different loops operating on arrays of elements and
that they all needed to be addressed to accelerate the tracking and mapping tasks effectively.
One of the key characteristics of the application is the large amount of data movement for these
functions. Something that came up in the timing tests is that the ARM CPU’s performance was
more significantly impacted by the large amount of memory transactions than the Intel desktop
CPU. One strong indicator was the relative drop in performance for the randomly accessed
frame in the centre of the residual calculation function. This is attributed to differences in
cache size and memory subsystem design complexity that mean the desktop CPU can hide
more of the memory system’s latency. Lastly, the ratio of time spent per function varied due
to the much wider multiply pipeline of the desktop CPU that is better able to cope with some
functions employing a large amount of parallel independent multiplications, such as the linear
system update function.
These results establish the importance of accelerating the tracking and mapping tasks in order
to enhance the performance of this algorithm in the embedded board. They also show that
SLAM, as is the case with many computer vision algorithms, is characterised by large and
non-trivial data movement and memory and caching strategies should be a high priority in any
platform designed for such algorithms.
At the time that the work presented in this chapter was submitted for publication in a peer-
reviewed conference, the comparison was made to the ARM Cortex-A9 that was available on
76 Chapter 3. Accelerating semi-dense SLAM
Task Execution Time Mean
Tracking task (hand-coded SSE) 12.3 ms
Map update task (4 threads) 14.2 ms
Table 3.2: Timing Results - Intel i7 - 4770 @ 3.77 GHz
Task Execution time mean
Tracking task 440 ms
Tracking task(NEON acceleration) 378 ms
Map update task - (1 thread) 1 Second
Map update task - (2 threads) 576 ms
Table 3.3: Timing Results - ARM Cortex-A9 @ 667 MHz
the FPGA board for testing. This demonstrated the gap between mobile CPUs, with power and
weight characteristics to fit in emerging embedded applications, and the performance required
to process semi-dense SLAM in real-time. The Cortex-A9 on the FPGA was running a full
Linux-based operating system, based on Ubuntu 12.10. In Chapter 5 we will also discuss using
a more recent mobile CPU with higher absolute performance, number of cores, and power
requirements, the ARM Cortex-A57, using a newer Ubuntu distribution (v. 16.04). However
the same conclusion still stands. In this type of application the required performance and
latency cannot be satisfied simply by advances in general purpose hardware, especially in the
light of Moore’s law slowing down. In this context, this work shows that for this application
a custom hardware solution can provide a significant improvement in performance per watt
and an absolute performance level that can tackle this challenge. In the rest of this chapter, a
deeply-pipelined architecture will be discussed as a candidate towards achieving this goal.
3.4. Accelerator architecture 77
3.4 Accelerator architecture
3.4.1 System architecture and control
In this section, an FPGA-SoC is the base device for an architecture developed to accelerate
key functions of LSD-SLAM. As presented in Chapter 2, this type of device offers a power
efficient mobile CPU tightly integrated with programmable logic fabric on the same chip. The
design was developed around the fixed memory system parameters that the off-the-shelf devices
provide. For the family of devices discussed here, there were four available “High-performance
Direct Memory Access ports” (HP - DMA) connected to the DDR controller allowing for direct
memory requests and high-speed burst transfers to be initiated directly from master controllers
implemented on the FPGA fabric. These can operate at a maximum of 150 MHz, with a bus
width of 64 bits, and contain small buffers that allows some combining of operations. The
custom hardware is also connected to a general purpose (GP) port, again through an AXI
interconnect, directly to the ARM CPU. Communicating through the GP port, the accelerator
on the FPGA acts as a slave to enable high-level control and parameter set-up from the software
running on the CPU.
The system architecture is shown in Fig. 3.4. Two accelerators were implemented in the work
presented in this chapter to offload the functions involved in the majority of the computation
time for the tracking task. They are presented in the figure as the “Residual and Weight
Calculation Unit” and the “Jacobian Update Unit”. The first is responsible for calculating
the map point re-projections, their photometric residual and the interpolated image gradient
on the projected location and the individual weighting factors for the error function. These
outputs are then processed in the Jacobian Update Unit, which generates the linear system
representing the weighed least squares optimisation.
Both the CPU and the custom hardware have access to the same memory space which helps
avoid redundant copying and simplifies communication. However, during execution of the
algorithm, for each camera frame the software transfers the data to be processed to a dedicated
area of the off-chip memory, reserved for accelerator-CPU communication. The reason for this
78 Chapter 3. Accelerating semi-dense SLAM
Figure 3.4: System Architecture
copy is that the input data of these functions comes from interfacing functions to the semi-
dense map, stored in two software buffers. These are scattered in multiple virtual pages in
the Linux OS-managed memory making memory accesses more costly in hardware, since an
address translation is needed for every page (4096 Bytes) of data accessed, and unless that page
is locked in memory, that process has to happen continuously. To alleviate this, they are copied
in this dedicated area where contiguous memory is allocated. The memory is IO-mapped at the
OS kernel / CPU level making it non-cached and avoiding a costly step of flushing the entire
L1 and L2 caches after writes from the accelerator or the CPU to synchronise the two. The
alternative has been well studied, and would be adding a translation circuit and optionally a
translation buffer on the FPGA synchronised with the OS. This would however complicate the
design significantly and generate its own overheads in access time, without offering significant
research value which led to the first option being preferred for this work.
Following the copy step, the embedded CPU sets different control signals on the reconfigurable
logic such as image dimensions, a previous pose estimate and of course the hardware pointers for
the dedicated data stores in the off-chip DRAM and calls the accelerators as a direct replacement
to the software functions. The tracking software thread executing on the CPU then waits for
a response from the custom hardware while other software threads perform depth mapping
3.4. Accelerator architecture 79
Mapping
Tracking Accelerators
Time
…
Control and Synchronization
Accelerator 1 –Residual and
Weight Calculation
Mapping
Accelerator 2 –Jacobian
Update Unit
…
Frame Cache Update
Frame Tracked
Pyramid L. 4 Level 3 Level 2 Level 1 Level 4 Level 3 Level 2 Level 1
Figure 3.5: Accelerated tracking and mapping execution in software
and global optimisation in the background. The accelerator may be called multiple times per
pyramid level without any further copy step required, since the allocated memory for the frame
contains valid data and is simply reused. This reduces the impact of the frame copy.
In Fig. 3.5 we can see a simple example of how this process might work. The top line represents
an example timeline of tracking two frames, one after the other, over a pre-computed map, while
in blue we see how mapping would execute simultaneously with some cost to copy information
over and synchronise different software threads. First there is the copy and control steps
mentioned above. Then the accelerator updates the frame cache if it is the first time operating
on this pyramid level. In the scope of the accelerator, the term cache is used to describe an on-
chip memory storing a copy of the current camera frame to improve random access performance.
The size of this cache is always equal to the entire frame due to the nature of the accesses.
Finally, we see a different amount of calls towards the two hardware accelerators, until a level is
changed. The width and number of different executions is not to scale but meant to showcase
the execution flow of calls and synchronisation. The cost of execution and synchronisation
overheads will be further evaluated in Section 3.5.
The second accelerator has as its inputs the values produced by the first accelerator, stored
80 Chapter 3. Accelerating semi-dense SLAM
in a part of the dedicated DRAM region. The decision to operate like this was because the
respective function for the second accelerator is not called for every error calculation. Instead,
the linear system is calculated once at the beginning of the optimisation process and then is
recalculated for every new estimate that actually reduces the reprojection error. Therefore,
matching that behaviour in hardware meant executing the second accelerator after the first one
finishes. Since the generated data until that point are too much to store in an on-chip cache,
and are accessed sequentially so a cache would not have a big impact in performance, they
are offloaded to the off-chip memory. Hence, after the first accelerator returns control to the
CPU, the software compares the new error and then optionally calls the second accelerator for
a linear system re-calculation using that output as its input.
A note on number representation
The custom hardware was designed to maintain the accuracy of the software functions it re-
placed, and be a drop-in replacement in the software implementation, interfacing with the rest
of the algorithm executing on the ARM CPU. Hence, most calculations are on single-precision
floating-point, with some intermediate results stored as double-precision. Designing custom
hardware opens the way to using custom-precision numerical formats. In Chapter 4 we discuss
how for example fixed-point was used to store and process pixel information and their gradients.
The pixels need variables that store decimal digits in tracking to preserve information since we
are processing subsampled images as well.
In general, there are many points in the system where the high-dynamic range of a floating
point representation is necessary, especially for operations in the accelerator of Chapter 5. A
half-precision floating-point format was also considered. However, in the accelerators presented
in this and the following chapter for tracking, a half-precision floating point was not explored
as it was not supported at the time of the system development by the tools. In the accelerator
presented in Chapter 5, a newer version of the tools was used to test a design utilizing half-
precision floats (16 bits) for units that could potentially afford the reduction in accuracy.
However, the first iterations gave a resource cost not significantly lower than the single precision
3.4. Accelerator architecture 81
float due to the cost of conversions to-and-from higher precision formats and resulted in larger
latency for some tasks.
Overall, for the work presented in this thesis the focus is in the exploration of novel architectures
for the system that potentially had a more significant impact and arguably a higher research
value at this point towards bringing the field of advanced state-of-the-art SLAM algorithms to
low-power embedded platforms. The use of custom precision, a well-researched topic in custom
computing, was left as a possible future exploration to further optimise resource utilization
once the architecture was set.
3.4.2 Residual and Weight calculation Unit
This unit implements the functionality of two software functions. It is responsible for the
projection of map points to the camera frame and the calculation of the error, photometric
gradients and weights that result from this comparison. The values of these gradients and
the intensity need to be calculated for a point with the image plane coordinates (u, v) that
result from the re-projection, as demonstrated in Fig. 3.8. To estimate a sub-pixel version of
these quantities a linear interpolation method is employed that utilizes a window of 4 pixels
around that point and weighs their contribution based on the distance to the 4 surrounding
pixels. The implementation of this will be discussed further in Section 3.4.2. Its functionality
is demonstrated in Fig. 3.6. Based on the floating point coordinates (u, v) a weighting factor
is calculated for the four pixels surrounding a point, estimating an interpolated value for its
intensity. The same weights are used to interpolate the gradients dx, and dy as will be discussed
further in Section 3.4.2. Based on these values, the error and the weights necessary for the
optimisation process,are calculated and streamed out. These values will be needed to calculate
the total residual of the estimated pose. The overview of the architecture of this hardware unit
is presented in Fig. 3.7
The first task is to re-project the reference point from its 3D coordinates (recovered from the
Keyframe’s pose and the estimated inverse depth) to the current camera plane based on the
current pose estimate. This involves a matrix to vector multiplication and a vector addition.
82 Chapter 3. Accelerating semi-dense SLAM
(u, v)
𝐼 𝑢, 𝑣 = 49% + 21% + 21% + 9%(x, y)
(u, v) = (x + 0.3, y + 0.7)
Figure 3.6: Intensity value interpolation for a projected point using its 4 neighbouring pixels
Figure 3.7: Residual and Weight Calculation Unit
The projection results in the new coordinates, x’, y’ and the new depth (z’) from the frame
of reference of the camera pose. They are then divided by the depth z’ and multiplied by
the intrinsic camera matrix, according to the pinhole camera model, with an optional final
correction according to the lens characteristics to give the corresponding coordinates on the
image plane. The results of this is a set of two coordinates corresponding to the actual captured
image frame, (u, v), as shown on Fig. 3.8.
These are passed to the interpolated element block which, as described before, returns the
interpolated result for the gradient dx, dy and the intensity. After the interpolated element
block returns the three results the residual calculation unit then proceeds to complete the
computation. The first step is calculating the actual residual, which is the difference between the
interpolated intensity at the current frame, and the intensity of the keyframe at the projected
coordinates. Then we proceed to accumulate five sums. The sum, and sum of squares, of the
interpolated intensity value (∑c and
∑c2), the sum and the sum of squares of the keyframe
3.4. Accelerator architecture 83
World Point P (x, y, z) w.r.t. Keyframe
OROL
Projected Point (u, v)
Map point Pi
KeyframeCurrent Camera Frame
Figure 3.8: Tracking involves projecting a map point, with a recovered inverse depth 1/z tothe image plane in the current camera frame
pixel intensity (∑c2 and
∑c2
2), and the sum of a local weight factor:
weight =5
|residual|, if |residual| < 5 and 1 otherwise.
Finally the weight calculation takes place. This was originally a separate function in the code
but was included in this unit since it results in a more efficient design, reusing all the temporary
results, before they are written to the off-chip DRAM. At this stage the weights and final error
are calculated, normalising with the estimated depth variance of the map point.
From a high-level perspective, the module is organised in three pipelined hardware blocks.
One block handles pre-fetching batches of inputs, one block writing back batches of outputs
and the main calculation block performs all of the necessary computation for the residual and
weight calculation. Before the unit begins work, if this is the first step of this pyramid level,
the entire frame being tracked is pre-fetched in a local frame buffer. This is done because
pixel gradients, required as part of the algorithm, are accessed in a random pattern, and would
require extensive unordered memory read requests. In the software implementation these values
would be calculated before tracking begun and stored in the DRAM as a floating-point buffer.
84 Chapter 3. Accelerating semi-dense SLAM
Figure 3.9: Residual Calculation Pipeline - Pixel Re-projection
Instead, by pre-fetching pixel intensity information and calculating these quantities on the fly,
there is a significant reduction in memory bandwidth and total latency for the computation at
the cost of extra hardware adders.
The first section of the pipeline of the residual and weight calculation is depicted in Fig. 3.9,
and corresponds to the projection of a map point to the image frame coordinates (u, v). For this
set of computations, there are a total of 9 multiplications and 6 additions. These are pipelined
into three hardware float multiply units and two adders. Targeting a processing interval of 6
cycles to match the latency of the random accesses from the frame cache of the system allowed
a more relaxed allocation of resources in this section. The frame cache, also discussed as part
of the system architecture, contains during computation the current camera frame to guarantee
low latency access for the random memory requests during the residual and weights calculation.
The result from these multiplications and additions is the point coordinates from the new frame
reference. These are then fed into one divider, then one multiplier and finally an adder, all
of which are also time shared, to produce the final pair of coordinates (u, v), that forms part
of the input of the Interpolated Element Unit, described in the next subsection. The main
pipeline was not aggressively optimised for throughput since the main bottleneck at this point
was at the communication with the off-chip memory 1.
After the Interpolated Element Unit calculates the intensity and gradient values for the pro-
jected point, the rest of the pipeline calculates the different sums described in the beginning of
the section necessary for the weighted least squares optimisation. At this stage there are 4 mul-
tiplier units utilized to perform these calculations. One is dedicated for the squared sums, but
1In the next chapter, lessons carried from this work enabled us to test and utilize several optimizations tosignificantly improve the pipeline’s throughput figures.
3.4. Accelerator architecture 85
the other three, together with one of the multipliers depicted in Fig. 3.9, are time-shared with
the multiplications in the weight calculation block. There are also present two float division
units which are also shared with the weight calculation block.
In total the weight calculation section of the pipeline performs 18 float multiplications, four
divisions, two additions, two subtractions and one accumulation. Most of its units are time
shared with the residual calculation as mentioned. There are also a float square root unit and
an absolute value unit. The result of this function is a weight factor corresponding to the
current residual.
After this computation finishes, seven results are buffered to be written back to the off-chip
memory. The first three are the re-projected x’, y’, and z’ coordinates, intended for use from the
second accelerator to use. The residual for this tracked point, and the interpolated gradients
dx and dy are also buffered and written back. The unit buffers all of these results in a set of 6
Block RAMs. Another Block RAM is used to buffer the resulting weight factor for the residual
from the weight calculation which is the 7th value to be written back.
Some of the allocation and sharing of hardware units is a result of the automatic scheduling
in the High-level Synthesis tools that were used to design the presented accelerator. For some
cases, such as here, the synthesis flow can find design points that are more efficient than a
straightforward hand-built pipeline, in a fraction of the development time required to sched-
ule these operations manually using traditional HDL languages. However, this is achieved by
attempting to reuse units across the pipeline for different operations, and can result in con-
nections that are less organised than the output of a human designer and hence sometimes
unintuitive and harder to depict in a clear way. For this reason, from this part of the thesis we
will focus more on resource usage and corresponding throughput and less on the topology of
the hardware except when that topology is significant to the presented architecture.
The first design explored as part of this work issued memory transactions directly from the
main pipeline calculating the residual and weight values when new values were ready to be read.
However this was found to suffer significantly from the memory system’s latency, resulting in
a low performance. This was then converted to the system presented in Fig. 3.7, using buffers
86 Chapter 3. Accelerating semi-dense SLAM
to convert memory transactions to bursts. The I/O blocks prefetch batches of 50 map points
(5-vectors) in a 250 word buffer and then wait for the residual and weight calculation to finish
before writing out 7 bursts of 50 word outputs, each corresponding to an internal buffer for
that output array, implemented in BRAMs.
Let us discuss first the choice of an image cache and the recalculation of gradients as needed.
Testing a configuration on the board with a hardware pipeline calculating the gradients on
the fly as opposed to issuing transactions to the off-chip memory, the hardware pipeline’s
throughput was increased by more than a factor of two by using the frame cache and performing
the extra computation. Firstly, the size of the data loaded is smaller since only pixel intensity
information is loaded, avoiding the transfer of an additional two floating-point gradient values
per pixel. Secondly, by performing one large burst read from memory the overall latency cost
is orders of magnitude smaller in comparison to multiple random access requests. This design
also improves power efficiency, despite the extra multiplications, since off-chip memory requests
which are more costly in terms of energy are reduced.
This strategy of using an on-chip frame cache for the current camera frame was not replicated
in the case of mapping information, which forms the rest of the input for the tracking task.
The comparative size of the map is much bigger than the camera frame, but the main reason
is that the mapping information is accessed sequentially, making caching less important since
burst transactions to memory can be used to improve the efficiency of the memory system. A
first, straightforward configuration was designed that would generate these sequential requests
to memory while accessing the map information. It turned out that the behaviour of that
system suffered a large performance bottleneck from the memory system latency, revealed after
simulation during the physical board tests. Because the hardware generated by the tools would
not overlap multiple memory requests and the latency for replies from the high-performance
ports was greater than anticipated the accelerator was operating at a much reduced rate. A
later modification led to the system that was published and is presented here, where the unit’s
inputs and outputs are read and written as buffered bursts of 50 vectors rather than sequentially
for every iteration. This takes advantage of the increased efficiency of burst reads and writes
and reduce the overall cost of the memory access latency.
3.4. Accelerator architecture 87
Finally, this combined with the on-chip frame cache improved the performance of the hardware
units by three to four times in terms of accelerator throughput. The combination of these
strategies however proved insufficient in practice to provide the necessary performance for real-
time operation as a significant memory bottleneck remained and the pipeline was underutilized.
The main reason is that the communication and computation still could not be overlapped
without a complete redesign of the pipeline, which took place in the following work presented
in Chapter 4 utilizing all the lessons learned here. The accelerator presented in this Chapter
executes a burst read, saves it at an intermediate buffer and then processes the first 50 data
points (or 250 words). Then, it streams out the results, and prefetches another set of 50. The
points are not always divisible by this number so the remained would be processed by the same
pipeline, in a sequential configuration with a lower throughput. The performance effect of this
is small since the problem size is larger than 20,000 points, so the only significant cost of this
was a small increase in resource usage for the extra control logic.
Interpolated Element Calculation
This function interpolates three quantities between 4 neighbouring pixels. The horizontal gra-
dient, the vertical gradient and the pixel colour. The aim is to give sub-pixel accuracy to the
tracking process. The interpolation uses the floating-point coordinates (u, v) corresponding
to the image plane. The integer part of theses is used to calculate the memory offset for the
pattern in Fig.3.10, and then the decimal digits are used to weigh the interpolation process
between the 4 pixels.
This unit performs four sets of reads from the pixel cache, stored in two buffers implemented
with registers that store different line segments referred to in Fig. 3.11 as line buffers, for a total
of twelve pixel intensity values. These twelve pixels are used for the interpolation process. There
are four centre pixels whose intensity is interpolated as was demonstrated above in Fig. 3.6 and
for which we calculate four sets of the two gradients dx and dy. For each of the four sets of
gradients we need access to 4 immediate neighbours, hence a total of 16 pixels, which overlap
as shown in Fig 3.10. The top left pixel used for the intensity interpolation is the one with
88 Chapter 3. Accelerating semi-dense SLAM
coordinates (floor(u), f loor(v)), shown in dark grey in the figure.
Figure 3.10: Pixel Gradient and Intensity Interpolation
The gradients for a pixel with coordinates (u, v) are calculated as below and interpolated linearly
in the same way as the intensity values:
dx =1
2(intensity(u+ 1, v)− intensity(u− 1, v))
dy =1
2(intensity(u, v + 1)− intensity(u, v − 1))
A high-level view of the hardware unit implementing this functionality is shown in Fig. 3.11.
In the figure, the Fetch Pixel Region hardware generates addresses and accesses the on-chip
frame cache to retrieve the 4 line segments comprising the 12 values shown in Fig 3.10. Then
two values are read each cycle from the buffers, and are fed into a series of multipliers and then
add/sub units to produce the final interpolation results at a rate of one set per 5 cycles. The
thicker lines in the figure indicate the transfer of multiple values per cycle, while thinner lines
are wire connections. The hardware units inside the rounded rectangles were time-shared across
many different operations, with the state machine at this part switching multiple multiplexers
and control signals to manage this. In this figure some of these connections and multiplexers
are omitted for clarity of presentation.
The operations performed are 6 float subtractions, 12 float multiplications, and 11 float addi-
tions in total. The overall initiation interval was used to guide resource allocation in this unit
as well. The result of these calculations is a set of three elements, dx, dy, and an interpolated
3.4. Accelerator architecture 89
Figure 3.11: Element Interpolation
intensity C, fed back into the rest of the pipeline.
3.4.3 Linear System Generation - Jacobian Update Unit
This second accelerator, originally called the Jacobian Update Unit in the published work,
calculates the derivative values for every single reference pixel and uses them, as in the system
of equations below, to calculate the Jacobian as well as the Hessian approximation H ≈ JTJ .
It then integrates these in a linear system, the solution of which will provide the next update
step for the tracking task.
The accelerator reads back the last version of the buffers containing the results of the other
residual and weight calculation (running in the first accelerator) when the linear system solution
is actually accepted as one that reduced the error. These contain the x’, y’, z’ dx, dy, weight
90 Chapter 3. Accelerating semi-dense SLAM
and residual values for each reference pixel that was tracked. These are used to calculate the
6 values for a Jacobian that is reduced to a row vector, since our function has a scalar output:
f : R6 → R. The 6 elements of the Jacobian are calculated as follows:
J =
J(1) = dxz
J(2) = dyz
J(3) = − xz2dx− y
z2dy
J(4) = −xyz2dx− z2+y2
z2dy
J(5) = z2+x2
z2dx+ xy
z2dy
J(6) = −yzdx+ x
zdy
Then the Hessian is approximated as the multiplication of the Jacobian with itself with the
added weight factor. The linear system is generated by accumulating these results for each
reference point i, with :
A = A+ J(i) · JT (i) · w
b = b+ JT · residual(i) · w
In total this accelerator contains twelve floating-point multipliers, one divider, three add/sub
units, and three dedicated floating-point adders. The initiation interval was set at seven hard-
ware cycles to match the limit of reading a set of seven inputs per iteration from a single
port that can only schedule one word read per cycle. This allowed a simpler accumulation for
the Jacobian vector, which introduces inter-cycle dependency. Again the allocation of units,
scheduling and time-sharing was chosen from the tool guided from pre-processor directives and
mainly the ones setting the targeted initiation interval.
3.5. Evaluation 91
3.5 Evaluation
The SLAM System was implemented and tested on an Avnet Zedboard carrying a Xilinx Zynq-
7020 SoC. This includes a dual-core Arm Cortex-A9 processor running at 667 MHz with a Xilinx
FPGA on the same chip and an off-chip memory, a DDR3 512MB RAM. The programmable
logic used contains a total of 85K Logic Cells, 4.9Mbits of BRAMs and 220 DSP blocks available.
For evaluation, the design was synthesized and placed-and-routed with Vivado HLS and Vivado
Design Suite (v[2015.4]), targeting the development board, on which it was eventually run and
tested. Running on the ARM Cortex-A9 is an Ubuntu-based Linux OS which supports the
software, as well as custom drivers to interface with the hardware accelerators. The bootloader,
OS and all relevant file systems were placed and run from an SD card.
The first step for the evaluation was to port the LSD-SLAM algorithm to that system and
resolve library compatibilities, often having to compile libraries for the ARM core separately.
Next, it was run as pure software to get timing information for different tasks and functions.
These were compared with the profiling results to confirm our conclusions and guide further
the accelerator design. Then, a user-space driver was written to interface with the FPGA
accelerators, at which point both the software and the hardware version were run in parallel in
the same system to check their behaviour. SLAM is inherently a family of algorithms dealing
with probabilistic estimates based on imprecise data. As such, in this work we focused on
achieving results equivalent with the open-source software implementation that was proven to
work as expected on real-world datasets both on the published works [17] and run our own
machines for verification. After verifying that the hardware accelerator’s output was the same
as the output of the software functions, the full SLAM algorithm was run with the hardware-
based version of tracking.
The FPGA was run at a clock frequency of 100 MHz, to meet the timing achieved by the
tools. At that frequency the system achieved on average a total frame processing time, in-
cluding the cost of memory copy and some bookkeeping on the software side, of 218ms per
frame. This corresponds to a performance of on average 4.55 frames per second for our main
test scene, tracking at a resolution of 320x240. This resolution is typical for state-of-the-art
92 Chapter 3. Accelerating semi-dense SLAM
direct SLAM, and as mentioned in Engel’s work [17], tracking stops at one pyramid level lower
than the camera frame’s original resolution, while mapping uses the full resolution to obtain
depth measurements. For the exact same test sequence the software-only version, without the
NEON extensions, running on the dual-core ARM A9 achieved 2.27 frames per second while
the use of the hand-coded assembly routines raised that to 2.6 frames per second. The tracking
performance, frame-by-frame of both the software and hardware versions can vary with the
amount of map points present in the scene.
More complex scenes require more operations and both the hardware and software versions vary
in running time approximately linearly with the number of points. The average performance
results are given on figure 3.13 where this difference becomes evident. In the figure, the lower
bars represent a particularly complex sequence with a higher ratio of map points to total pixels.
In the centre of the figure, the vectorised version with inline NEON assembly is included as a
fair comparison to the best achievable performance with this general purpose CPU platform.
Examples of two scenes that would generate maps with a different number of points are in
Fig.3.12. The scene on the right contains fewer areas of high-texture that are mappable and as
a result less computation time.
Figure 3.12: Different scenes will generate Keyframes with fewer or more mapped points, whichaffects the runtime both in Hardware and in Software
The FPGA accelerated system was consistently 2× faster for a variety of scenes compared to
the pure software version. The benchmarks used are included in Table 3.4, and were provided
by TUM [75] as example sequences for LSD-SLAM. This is a promising result, and it is only
3.5. Evaluation 93
constrained by memory bandwidth, and not by computation performance. This performance
was achieved with an estimated dynamic power consumption of 0.71 Watts for the FPGA, and a
chip total of approximately 2.25W. The power consumption of the software-only version without
the vector instructions was estimated at 1.54 Watts for 50% average core load, as a conservative
estimate. Even with these figures the FPGA can perform the computationally intensive task
of tracking at an estimated 6.40 frames/s/Watt in comparison to just 1.48 frames/s/Watt for
the CPU. The power figures for this chapter where based entirely on estimates given by the
Xilinx Vivado tools. In Chapter 5, power monitoring was included for the development board
to compare different platforms with greater accuracy.
2.27
2.6
4.55
1.92.2
3.84
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
ARM A9 ARM A9 with NEON Hardware Accelerated
Frames / Sec
Figure 3.13: Average performance in frames per second for two different sequences selectedfrom the datasets in Table 3.4. The grey colour corresponds to a more dense, complex scenewhere a higher number of points are mapped and used for tracking. Examples of two scenesthat would generate maps with a different number of points are in Fig.3.12.
For the FPGA used in this chapter, the utilized resources were 54% for the DSPs (119 DSPs)
and 44.7% for the Flip-Flops (47.5K), as demonstrated in table 3.5. However the current design
is not well utilized. A rough estimate on available performance based on the multipliers and
adders generated is estimated to be approximately an order of magnitude higher. It turned
out that this architecture could not cope with the memory traffic and the memory system’s
latency. Additionally, the design of a deep pipeline to combine all the necessary computation
94 Chapter 3. Accelerating semi-dense SLAM
Dataset Duration Resolution Description
Desk Sequence 0:55min 640x480 @ 50fps Indoors, small clut-tered space
Machine Sequence 2:20min 640x480 @ 50fps Outdoors, combina-tion of near and far-away objects
Table 3.4: Datasets used, provided on TUM’s website by the authors of LSD-SLAM
Estimated by Tool LS. Update (J) Res. &Weight Unit Post-Implementation
DSPs 48 78 119
BRAM 0 73 73
FFs 14630 32328 47570
LUTs 24902 41383 40298
Table 3.5: FPGA Resources. The first two columns represent the resources post synthesis,where the tool has allocated a certain number of resources at each instantiated hardware unit .Post-implementation, Vivado uses various optimisations to reduce usage by combining circuitsor simplifying various units.
was expected to improve utilization by allowing the automatic scheduling algorithms of high-
level synthesis to time-share hardware units more effectively. Instead, possibly due to the
immaturity of the tools but also due to the nature of high-level synthesis frameworks, the
tool couldn’t schedule many operations effectively in this fashion, and instead generated more
overheads when a pipeline was longer and more complex.
As we will discuss in the following chapter, designing smaller hardware units that are con-
nected to each other through buffers, where crucially the tool cannot determine the input and
output schedule but treats them as a latency of 1 operations, turned out to be more effective
as it allowed us to manually tune particularly critical pipeline paths for better latency and
performance characteristics. The most major redesign that gave a performance breakthrough
in Chapter 4 however was the conversion of the design philosophy to a dataflow paradigm
which suited this type of application much better. We will discuss this in further detail in that
chapter.
3.5. Evaluation 95
520 1221.11
4602.69
38939.1
0
5000
10000
15000
20000
25000
30000
35000
40000
Level 4 Level 3 Level 2 Level 1
μSeconds
Figure 3.14: Memory transfer cost in microseconds to synchronise the map and frame with thehardware buffers on the DRAM. This happens once per pyramid level.
As it stands, this design achieves more than 4.5 frames per second, including some time spent
on data movement. The measurements for the cost of data copies are shown for each level on
figure 3.14. Trying to interface custom hardware with software that uses large dynamically
allocated buffers brings a penalty due to the cost of synchronising a large amount of data.
These buffers are also stored in virtual pages, scattered in different places in the physical
system memory. It turned out that the amount of input data that has to be copied to a
dedicated area of the RAM will take a significant percentage of the total execution time, up to
25% at this performance level.
The per-function performance of the hardware accelerator scales linearly based on the number
of points, with a small extra penalty on the final level due to some additional reads and writes
necessary to support a feature of the original software implementation, namely a reported pixel
quality metric. This is shown in figure 3.15. In that figure, Residual and Weights refers to
the calculation of the photometric residual and the corresponding weights in the first hardware
accelerator, two different functions in software, combined here in a sum for easier comparison.
Linear System Update refers to calling the Jacobian update unit to calculate the Jacobian and
96 Chapter 3. Accelerating semi-dense SLAM
0
5000
10000
L E V E L 4 L E V E L 3
L E V E L 1
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Residual And WeightsFPGA
Residual And WeightsSoftware
Linear System UpdateFPGA
Linear System UpdateSoftware
μSeconds Level 4 Level 3 Level 2 Level 1
Figure 3.15: Software and hardware timing for individual functions. These are run multipletimes until the error stops decreasing significantly, at which point the process is repeated forcoarse-to-fine pyramid levels until the penultimate level (once subsampled from the originalimage) is reached
Hessian, and then solving the linear system. It is important to note that the Residual and
Weight Calculation unit is called more often than the Jacobian Update unit, while the memory
copy mentioned in the previous paragraph happens only once per pyramid level.
The pipeline for the residual and weight calculation has an interval of 6 cycles, which translates
into a throughput of approximately 16.6 Million points per second. However to be able to
achieve that throughput the input data rate would have to be about 300MB/sec and crucially,
the output rate would have to be more than 420MB/sec. After several hardware architectures
were designed and tested on the development board it turned out a good design point is
to perform one larger burst transfer for 50 reference points (a total of 250 words or 1000
Bytes), perform the computation on them caching the results locally to allow the pipeline
to function as fast as possible and then write the results in a series of 7 successive burst
writes back to the DDR. The accelerator updating the linear system has a similar bottleneck
from memory performance and latency. The technical reference manual indicates that ideally
the high performance DMA port on the SoC can sustain bursts of 255 words, transferring
3.6. Conclusion 97
one word per cycle at a max frequency of 150MHz, or 600MB/sec sustained, so if a steady
stream of data could be supplied, that would mean approaching the theoretical limits of the
port. What this shows is again is the high communication demands characteristic of SLAM
algorithms. The conclusion from this is that a complete redesign of the way the data is stored
and accessed is a necessary hardware/software co-design step to allow any hardware accelerator
to function efficiently. Otherwise the cost of data synchronisation will reduce performance gains
appropriately according to Amdahl’s law.
The latency and acceleration results discussed in this section demonstrate that this hardware
design approximately doubled the performance and gave an improvement of more than 4× the
performance-per-watt compared to a low-power embedded CPU. This doubling in performance
at that power level means that at a power budget smaller than that of a mobile CPU the
algorithm can follow movement that is twice as fast for the robot/platform with the same
tracking accuracy as before, or alternatively provide more accurate tracking estimates and
as a result a more accurate reconstruction at the same platform velocity compared to the
software version. In Figures 3.16 and 3.17 the impact of running through the same dataset,
but having to drop frames due to a reduction in performance of 2 − 4× is demonstrated. A
minimum level of performance for a certain platform is necessary to avoid very large errors and
potential failure in tracking due to fast movement. However, further improvements past that
minimum are important as they can provide a larger improvement in quality and accuracy for
the same algorithm. For example in Fig. 3.17 a 2× reduction in performance leads to visible
deformation of recovered map surfaces, reduced detail on some structures and a significantly
larger accumulated trajectory error on the map at the left of the figure.
3.6 Conclusion
Custom hardware on an off-the-shelf FPGA is capable of bringing advanced and more dense
SLAM algorithms to embedded low-power devices. However, different bottlenecks in the pro-
posed architecture prevented the platform from reaching its potential. A hardware/software
98 Chapter 3. Accelerating semi-dense SLAM
Skipping Frames due to performance constraints
Adequate performance for the movement speed
Figure 3.16: In a moving platform or camera, the performance of SLAM will directly affect theaccuracy of tracking and the accuracy and quality of the reconstruction, and in extreme caseslead to tracking loss. The map on the right is on a system that can deliver a 4× improvementin performance compared to the left. The map on the left has accumulated a very large errorfrom skipping frames due to performance constraints
Higher performance leads to more information processedReduced drift and error accumulation for pose and map
Figure 3.17: Even if the slower system can keep up, improved performance is crucial to improvequality and accuracy under fast movement. The map on the left was recovered on a systemwith a performance deficit of 2× compared to the one on the right
3.6. Conclusion 99
co-design step is necessary since the CPU-optimised algorithms and implementations are not
well suited to an accelerator. A hardware design should place a lot of importance in mem-
ory architectures, including caching techniques and data movement with overlapping memory
accesses and computation to ensure a robust high level of performance. With these communica-
tion requirements in mind, as well as the nature of FPGAs, which function at a lower frequency
but can perform a large number of operations simultaneously for every cycle, a streaming in-
terface is ideal, where a partial redesign of state of the art algorithms would create a sequential
data movement from one point in the pipeline to another. If a redesign of the front end ensures
that data passes through the FPGA at a high sustained rate, even a small FPGA-SoC such as
the one used in this work can achieve real-time performance for the task of tracking inside a
small power budget of less than 3Watts.
The following chapter discusses exactly such an architecture, designed from the ground-up with
many different domain specific optimizations around a dataflow design. As we will discuss in
Chapter 4 it achieved a sub-10ms latency for all the iterations comprising one tracking step.
Chapter 4
Accelerating Tracking for SLAM
The previous chapter presented an investigation on semi-dense SLAM fo-
cused on LSD-SLAM in modern high-end desktop machines and embedded
platforms. The computation patterns and memory and performance require-
ments where discussed and profiling and timing results were used to guide
the acceleration of key methods. A straightforward implementation for an
accelerator in an FPGA-SoC showed the potential performance that could
be achieved by these platforms but revealed bottlenecks and limitations that
reduced the peak achievable performance. In this chapter, we present a high-
performance design for a custom accelerator targeting the task of tracking
in semi-dense SLAM. Taking advantage of the lessons learned in Chapter 3,
the design presented here focuses on providing an efficient, high-performance
solution for direct tracking using a high-bandwidth streaming architecture,
optimised for maximum memory throughput. At its centre is a Tracking
Core that accelerates the necessary computation involved in non-linear least-
squares optimization for direct whole image alignment at a much higher per-
formance per watt compared to general purpose hardware.
100
4.1. Motivation 101
The architecture is designed to scale with the available resources in order
to enable its use for different performance/cost levels and platforms.
Evaluated, it achieves real-time performance on an embedded low-power
platform, without any compromise in the quality of the result.
This chapter is based on a conference paper co-authored by Christos-Savvas
Bouganis [Boikos Konstantinos et al., FPL, IEEE 2017 [19] ]
4.1 Motivation
SLAM is composed of two interdependent tasks, strongly tied together, tracking and mapping.
In an unknown environment the reliable output of the first is needed to start generating the
second and vice-versa. The task of tracking is the first component to address since it has the
strictest latency requirements for processing and dropped frames. A low framerate and high
latency for tracking will result in an autonomous robot accumulating errors during tracking and
mapping and hence a much slower and less robust operation, with a strong drop in performance
potentially resulting in a tracking failure where the pose cannot be recovered at all. In addition,
profiling results, presented in Chapter 3, showed that together with mapping, tracking accounts
for the majority of the computational requirements of real-time visual SLAM.
As discussed in the previous chapter, embedded grade CPUs, with power and weight character-
istics to fit in emerging applications in robotics, IoT and augmented reality, lack the required
performance to process Semi-dense SLAM in real-time. At the same time, this field is still char-
acterized by significant volatility, with algorithms that change significantly every few years. The
use of FPGAs will allow efficient heterogeneous architectures to be explored and used in the
field while, reducing the cost of changes to follow the state-of-the-art in the software. Custom
hardware accelerators mapped onto an FPGA-SoC acting as co-processors to a mobile CPU
can achieve both the high performance and low latency requirements of these applications while
at the same time significantly increasing power efficiency for this task, making them a good
candidate towards overcoming this challenge.
102 Chapter 4. Accelerating Tracking for SLAM
In this context, this chapter presents a custom hardware architecture with a combination of
dataflow processing and high-bandwidth memory interfaces that provides a high frame-rate, low
latency solution for direct photometric tracking for SLAM using a monocular camera. For this
chapter, mapping is performed in software based on the work by J. Engel et al. [17]. To the best
of our knowledge, this is the first high performance hardware architecture for direct photometric
tracking for SLAM on an FPGA-SoC. A high level of sustained performance is targeted, to
address the latency requirements of the application together with low-power consumption, to
enable the embedding of this type of application in low-power, lightweight robotic platforms
such as quadcopters, fixed-wing drones and household robotics as well as for augmented reality
applications.
The key contributions presented in this chapter are the following:
• To the best of our knowledge, the first design for a high performance power efficient
accelerator architecture for direct photometric tracking for SLAM on an FPGA-SoC.
• An investigation on factors that affect the performance of this type of accelerator, the
peak performance that can be achieved by an off-the-shelf FPGA-SoC and bottelenecks
that can arise when used as a coprocessor next to a mobile CPU for this application.
4.2 Architecture
4.2.1 System Architecture
The system architecture is designed with an FPGA SoC in mind, a System-on-Chip that con-
tains a mobile CPU and an FPGA fabric tightly integrated. The CPU and FPGA share the
same off-chip DRAM, and they operate on the same memory space. Throughout this chapter,
the part of the accelerator that performs the computation on the input data will be discussed
as the “Tracking Core”. The inputs of this core are firstly a current camera frame, accessed
in a random fashion during operation, the current recovered depth map, accessed sequentially,
4.2. Architecture 103
and some input parameters for the algorithm such as the current pose estimate and the di-
mensions of the frame being tracked at the current pyramid level. As explained in Section 2.5,
in Chapter 2, tracking is performed multiple times for different resolutions from coarse to fine
to improve convergence performance, so the current input frame has a subsampled version for
each different pyramid level. Its output is a linear system, the solution of which provides a next
optimisation step, as will be discussed further on the next section.
To support these accesses on a system level, the following interfaces were used. The Tracking
Core on the reconfigurable fabric is connected directly to the off-chip memory controller through
two dedicated high performance ports (HP ports), where the accelerator acts as a master,
providing a high throughput and low latency stream of reads from the DRAM in a similar
fashion with the system on Chapter 3. This type of interface is crucial for the high bandwidth
requirements of this type of application. The custom hardware is also connected to a general
purpose (GP) port, essentially a peripheral port to the ARM CPU. Communicating through
the GP port, we enable high-level control and parameter set-up from the software running on
the CPU, where the CPU acts as a master. An important differentiator in the system level to
the work presented in the previous chapter is the increased efficiency in communicating through
the HP ports. In the accelerator presented in this chapter, dedicated input units, decoupled
from the processing pipeline, provide high-bandwidth streaming access to the off-chip memory
at the full bus width and contain large buffers to improve the memory access performance and
efficiency.
An important aspect of this architecture is the movement of data inside the core, as well as
to and from memory. As this architecture is not meant to be device or resolution specific, the
memory interfaces are designed to surround a computation core with high-bandwidth streaming
access that can alleviate the bottlenecks on communication and future-proof the system so that
the hardware’s utilization can be close to 100%. Thus, performance is dependent mainly upon
the capabilities of the computation core and can scale for different FPGAs or a fully-custom
design, avoiding the pitfalls and bottlenecks discussed in Chapter 3. As we can see in Fig.4.1,
the communication with the off-chip DRAM, is routed through two dedicated hardware ports.
Data reads are handled by two input blocks each with its own FIFO buffers, to allow the input
104 Chapter 4. Accelerating Tracking for SLAM
system to absorb memory inefficiencies and delays. Having dedicated ports means that no
CPU time is consumed to route traffic to and from DRAM. Additionally, going directly to the
memory controller from a dedicated port means that the custom hardware avoids contention
with most of the memory traffic software running on the ARM CPU generates, and this shorter
path has a much lower latency to initiate a transaction.
There is also an output block in the core, with its own master controller, to handle writing
the output back to DRAM after the core has finished processing all elements. This one shares
one of the ports with the Input units above, since the ports can be operated in duplex mode.
The writes happen after all points are processed and are orders of magnitude fewer than the
read. Hence, the read bandwidth is not affected from this topology. All the ports are designed
to execute transactions using the full bus width in burst transactions, with custom units in
the FPGA unpacking data on the fly at the input and output to translate to and from the
original word width. One of the biggest bottlenecks in the work presented in Chapter 3 was the
latency from initiating to completing a memory transaction. By queueing long burst requests
of hundreds of words, buffering the resulting reads and overlapping all I/O operations with
computation, the presented architecture results in a core that can leverage very high read
bandwidth from the memory system, in order to be able to support even the most memory
demanding algorithms.
Throughout this core, the communication between hardware blocks is done through FIFOs,
as seen in Fig. 4.1 and Fig. 4.2 following the dataflow paradigm. This enables the different
hardware blocks to function independently from each other, beginning processing as soon as a
data point becomes available and blocking if any stream becomes empty. This decouples the
internal tracking pipeline from the input blocks, and by configuring the input blocks at a faster
rate than the rate of consumption further down the pipeline, the gaps that can arise between
memory requests due to memory traffic and the latency of the memory system can be hidden.
This way, the utilization of the hardware resources of the pipeline is significantly improved since
they spend less time waiting for data to arrive from memory. They are designed to operate at
a slower rate, using fewer resources, but with a high utilization (100%).
4.2. Architecture 105
Figure 4.1: System Architecture
4.2.2 Direct Tracking Core
The tracking core is designed as a set of hardware blocks that operate independently in a
dataflow paradigm, with matching processing rates, connected by sets of FIFO buffers. Instead
of explicit control signals most units start or pause processing depending on the availability and
type of data. Further taking advantage of this, functionality is split into smaller self-contained
units that are easier to design and optimise using high-level synthesis language. The hardware
is also designed to scale in terms of resources. By changing the target processing rate of the
pipeline’s components, and with minimal modifications, the proposed architecture can target
a larger FPGA or be scaled down.
Different parts of the pipeline can utilize a larger number of arithmetic units to schedule more
operations per cycle, or re-use fewer units to trade off processing throughput with resource
use. Especially in light of the tools used for design space exploration in this work, such an
architecture improves exploration of more efficient or high-performance design points by taking
advantage of the strengths of high-level synthesis and the strengths of an experienced designer.
High-level synthesis offers benefits such as automatic scheduling and hardware unit placement
106 Chapter 4. Accelerating Tracking for SLAM
which can give a quick solution, optimising for the lowest latency for a given set of operations.
Separating the functionality to smaller, self-contained units, naturally leads to simpler and
more efficient designs while making the problem size for each individual unit manageable for
the designer to identify better opportunities and override the automatic placement with a
hand-picked solution where necessary.
The inputs for the core, streamed from the main memory, are in the form of a five-element
vector for each Keypoint, with the reads split in a set of three and two elements for the first
and second input unit respectively. Each Keypoint is a point of the recovered depth map that
has a valid depth estimate as described in Chapter 2. These five elements are the estimated
coordinates relative to the Keyframe of a Keypoint (since only the valid Keypoints of the map
are read by this accelerator), the depth estimate, the depth variance and the intensity of that
point as observed by the camera. Additionally, for each Keypoint, a 4x4 window is needed
from the currently tracked frame. The entire current camera frame is preloaded on the FPGA
once for every pyramid (resolution) level, in a partitioned memory described in Section 4.2.3, to
avoid the off-chip memory latency which for successive random accesses would quickly become a
very significant bottleneck. In this work it was demonstrated that performing some redundant
computation, in order to significantly reduce memory traffic, leads to higher performance but
can also improve power characteristics owing to the larger energy cost of transferring data
off-chip compared to performing processing on the fly.
In Algorithm 6 a high-level pseudocode representation of the tracking task is demonstrated, as it
was implemented in hardware. With the exception of some high-level decisions and the solution
of the final linear system that were allocated to the mobile CPU, most of the functionality of
the pose estimation is implemented in the reconfigurable hardware. The inputs are the current
camera frame at the resolution of the current pyramid level, an initial pose estimate and the
Keypoints from the recovered map, while the output is the current error for this pose estimate
and the Gauss-Newton linear system to solve for the next optimisation step. The accelerator
pre-computes the linear system update and streams it out to the external memory in case it
will be required in the next step of the algorithm.
4.2. Architecture 107
Algorithm 6 Tracking, hardware accelerator - Calculate next optimisation step
procedure Tracking Accelerator(new Frame, tracking reference, pose estimate,K)
#Happens for every new camera frame and pyramid level
Update Hardware Frame Cacheif New Camera Frame Captured then
Update Current Camera Frame
for all Keypoints in reference do #Keypoints with a valid observation
Stream-in (reference[Keypoint]#Implemented as deep streaming pipeline
Reproject Keypoint to new Frame(reference[Keypoint]) # Use current pose estimate
if Coordinates are valid thenCalculate interpolated intensity I and gradients dx, dy
elseSignal next pipeline units to skip
Calculate Intensity residualsCalculate squared values and sums
#Interleave accumulated values in multiple register banks
Calculate fitness scoreresidual = residual ∗ fitnessif fitness score < fitness threshold then
Signal next pipeline units to skip
#Directly calculate weights and error
variance = reference[Keypoint].depth variance#Cached
Calculate weight(variance, residual)Add pixel noise estimate to weight(weight)weighted error = Huber(error ∗ weight) #Huber Norm
summed error = summed error + weighted error ∗ weighted error#Pre-compute Linear System
Ji = Calculate Elements of Jacobian vector #From registered results
#Interleave accumulated values in multiple register banks, improving
# pipeline performance. Re-combine on exit
A = A+ Ji ∗ JTi ∗ weightb = b− Ji ∗ residual ∗ weighterror = error + residual ∗ residual ∗ weight
return errors, num successful points, A, b
108 Chapter 4. Accelerating Tracking for SLAM
Image Frame Projection
This hardware block is responsible for projecting a stream of reference points with coordinates
(x,y,z) from the current Keyframe to the plane of the camera frame whose pose is currently
being estimated. The projection is done based on the current estimate of the frame’s pose using
the pinhole camera model and the principles of multi-view geometry, described in more detail in
Chapter 2. The block finally generates a pair of floating-point image frame coordinates (x’,y’)
that are streamed to the Sub-pixel Element Interpolation. The most significant differences
for this block with the design in Chapter 3 are the stream interface, dataflow paradigm and
optimising the calculation patterns to allow deeper pipelining and a more aggressive schedule
for a significantly higher throughput.
The first of these optimisations has to do with accumulating different sums, a significant part
of the tracking algorithm. For every accumulation there is a 5-7 cycle latency for the adder
to produce a final result. To speed-up loops containing accumulations a template hardware
accumulator was designed that would cycle the outputs and inputs depending on the desired
throughput to different registers, creating in effect partial sums. At the end of the loop the
partial sum registers are added in a final step to generate the full accumulation sum. Since the
cost of that final step is in the tens of cycles and the main loop runs for tens of thousands the
added latency is insignificant but the throughput bottlenecks are lifted. Other optimisations
that are discussed in the next subsections include introducing fixed point for sub-pixel opera-
tions in the place of floating-point and lifting various pipeline bottlenecks to achieve a higher
overlapping of loop iterations and as a result higher throughput.
Sub-pixel Element Interpolation
This block receives the stream of floating-point coordinates (x’,y’), belonging on the image
plane of the current frame, generated from the Image Frame Projection block. Its functionality,
similar to the interpolated element calculation in the previous Chapter, is first to find the block
of 4 pixels surrounding that planar point and calculate their intensity gradients dx and dy. It
4.2. Architecture 109
(x’, y’, z’)
(u, v, control) (I, dx, dy, valid) Intermediate Results
Keypoint Position
Information
Keypoint Colour and Variance
Figure 4.2: Tracking Core Architecture
then proceeds to calculate an interpolated value for the intensity and the gradients of the point
(x’,y’), using the values of these 4 pixels, based on the decimal digits of the coordinates. The
output is a stream (I, dx, dy) of the interpolated values for the intensity and gradients. However,
as with the previous block, the implementation has changed to allow a lower latency and high
throughput design. This involved transformations of the computation and strategically forced
registers in the tool to allow a better pipelining strategy at the cost of extra resources.
The gradients for a pixel with coordinates (u, v) are calculated as :
dx =1
2(I(u+ 1, v)− I(u− 1, v))
dy =1
2(I(u, v + 1)− I(u, v − 1))
To calculate these gradients, this block needs to fetch a 4x4 window of pixel values for every
iteration, with an address and offset generated on the fly. The memory partitioning and window
access pattern, that will be explained in more detail in Subsection 4.2.3 are designed to allow low
latency access for these 16 values, with the option of scaling to a single-cycle access configuration
at the cost of more memory ports and blocks. Moreover, this block also generates a stream of
boolean values, with each value paired to a tuple of outputs (I, dx, dy), indicating if the image
coordinates are within the frame boundaries and therefore valid. This boolean values are used
further along the pipeline to indicate a skip for a point along the weight and L.S. calculation
path. Here is another key difference owing to the dataflow paradigm, which in this application
110 Chapter 4. Accelerating Tracking for SLAM
significantly improved the architecture’s performance and optimisation opportunities. If the
coordinates are invalid the pipeline keeps running, with a pre-set safe memory address and the
output is safely ignored in the next block.
Finally, this unit utilizes fixed-point computation, to store the subsampled image pixels and
for the immediate interpolation steps after that. At the end of the calculations a converter
stores the final results in floating-point registers for the next unit that uses them. This reduces
significantly the cache sizes, as well as the register use of this unit. However, the larger latency
of the fixed-point arithmetic units compared to the floating-point units using the tools available
necessitated some extra resources to maintain the same throughput. As mentioned in Chapter
3, in this work the focus was mainly in generating a novel high-performance architecture and
the custom-precision opportunities that required further investigation were left as a possible
future exploration to enhance resource efficiency once the architecture was finalised.
Error Calculation
This block receives the output streams from the interpolation process and additional infor-
mation forwarded from the Image Frame Projection unit. The block is responsible for the
calculation of different error metrics, the weights and residuals necessary the generation of the
linear system to be solved and the estimation of the total photometric error of the current pose
estimate. The input streams include the (x,y,z) coordinates of the reference points, re-projected
from the frame of reference of the new frame, the interpolated intensity and gradients streamed
from the Interpolation block and finally a stream of the intensity and depth variance of the
Keyframe reference points.
The first set of outputs of this block are the residual and a weight factor for each Keypoint, that
will be used along with the (x,y,z) coordinates and the gradients in the next block. It is also
responsible for calculating the error and squared error accumulated sums from the reprojection
of each Keypoint, utilized later in the algorithm, as well as sums for the pixel intensities, and
point quality characteristics for each Keypoint that is being tracked. All of this data is utilized
at the linear system generation, with parts of it also exported at the output block (not visible
4.2. Architecture 111
in Fig. 4.2) as quality metrics for the latest pose optimisation step.
Another key characteristic of these hardware units, including the Linear System Generation
unit of the next subsection, is that they deal with a large number of accumulations. To achieve
a sustained high throughput and low latency for these, the presented architecture stores results
as partial sums in a number of distributed registers to account for the latency of the floating-
adder pipeline as mentioned in the Image Frame projection subsection. After all Keypoints
have finished streaming through, a final accumulation step combines the registers to obtain the
final result. This way more multiply/accumulate units can be placed and function at a higher
utilization for an improved throughput.
Linear System Generation
Finally, this block receives and combines the Keypoint with the results of the previous blocks in
order to generate the Linear System later solved in software to provide the next pose estimate.
The processed information includes the (x,y,z) Keypoint coordinates intensity and gradients
as well as the error and weights that have been calculated. The solution step, as well as the
comparison of the error metrics and the update, are comprised of orders of magnitude fewer
operations compared to the error and weight calculation and the linear system generation and
they only happen at the end of the processing. The comparisons and branches comprise tens
to hundreds of cycles on the mobile CPU which are in the order of a few microseconds, while
the solution of the linear system for a 6x6 matrix using LDLT decomposition in the Eigen
library takes less than 0.25 milliseconds. As such, the resource cost of adding these operations
in hardware is not justifiable, and instead these operations and some high-level control is left
to the software running on the CPU, accounting for less than 2% of the runtime of the tracking
algorithm.
The functionality of this block is to generate the 6x6 matrix A = JTJw and the vector
b = Jrw, essentially the linear system to be solved as part of the optimization method described
in the Background. As matrix A is symmetric, only 21 out of 36 elements need to be computed.
Taking this into account, the intermediate results were stored in a partitioned memory of size
112 Chapter 4. Accelerating Tracking for SLAM
21. Depending on the target processing rate, copies of each element register are created and the
accumulations are interleaved so partial sums are stored in each of them. This memory shares
the same distributed register architecture such as the one mentioned in the Error Calculation
block, to allow single-cycle throughput for the accumulation operations. At the end of the
computation, these are added and repacked into matrix form to be written back to DRAM.
This unit is characterised by a large number of multiply and add operations per map element.
As a result, a large percentage of the total chip resources was allocated to this unit to satisfy
a requirement for a larger number of computational units to achieve a high point throughput.
4.2.3 Frame Cache Partitioning
The Sub-pixel Interpolation block requires access to a 4x4 window for each tracked valid map
point. An architecture is presented here for the frame cache to achieve low latency access to
windows of this size (4x4). This is then generalized for a window of any NxM size with M being
a multiple of 4 and N any integer. This mapping is optimised for an FPGA due to the type
and structure of the memory blocks available on this type of device. In most modern FPGAs,
memory is mapped to blocks of SRAM of thousands of bits with each block offering independent
read/write ports. In the devices we target here, 18/36Kbit Block RAMs are available, with
each block offering two independent ports that can operate synchronously. Thus, by organising
these blocks in different configurations one can access multiple pairs of elements simultaneously
provided they are guaranteed to exist in different Block RAMs.
Using the assumption that the image width is a multiple of 4, an assumption that holds in most
widely used resolutions in the literature, the image elements (pixels) are stored in memories
partitioned with a factor of 4 in a cyclic fashion. Thus, the memories are organised as 4 sets of
Block RAMs with each successive pixel cycling between the available sets. In this configuration,
pixels 0,4,8 ... will belong to the first memory set, pixels 1,5,9 ... to the second and so on. Using
the 2-port BRAMs available in this family of FPGA devices, eight successive elements can be
accessed in a single cycle from four sets of BRAM organised in this way. Then, successive
image rows can be stored again in different sets of memories, repeating this cyclic pattern in
4.2. Architecture 113
the vertical dimension in order to have the ability to perform parallel accesses for successive
rows. Combining the two, neighbouring pixels in a window can be accessed simultaneously by
using different memory sets in parallel.
This partitioning is demonstrated in Fig. 4.3. There are two levels of partitioning, the first one
dealing with storing successive image rows and the second with storing successive pixels in the
same row. A memory partitioned in this fashion for N row-sets and M column-sets results in
M ×N sets of memory with 2×M ×N independent ports for read and write access.
Address Generation
Unit
Memory #1
Memory #2
Memory #N
Memory #2.1 Memory #2.2
Memory #2.3 Memory #2.4
Mem. #2.(M-3) Mem. #2.(M-2)
Mem. #2.(M-1) Memory #2.M
Figure 4.3: Frame Cache Architecture
For the case of a 4x4 window, with N=2 and M=4 the address generation is performed as
follows. For each row, a base address is derived from the address of the first pixel on the left,
defined as the pixel index ∈ [0, (width× height)/2− 1], by setting the two least significant
bits (LSBs) as 0. This address, and the address base+ 1 which is calculated by simply setting
the least significant digit as 1, are sent to the memory banks in the current row as in Fig 4.4
while the two LSBs, defined as the offset, are used for the symmetric 4-to-1 multiplexing to
select the required values. Since we are operating in this case with a cyclic partitioning over
four memories, as we can see in the figure, this simplifies multiplexing and address generation
significantly compared to a non-multiple of 4 pixels per row. This partitioning also offers
improved write performance in light of the off-chip read bandwidth. The ports to the off-chip
DRAM read sets of 8 successive 8-bit pixel values as a 64-bit word per cycle, so the ability to
store 8 pixel values per cycle offers the best utilization of that bandwidth to lower the latency
of a frame-cache update.
Lastly, this assumption of a multiple-of-4 pixels per row allows calculating the address of succes-
114 Chapter 4. Accelerating Tracking for SLAM
sive rows of the window by replicating this hardware and memory partitioning, and generating
the address for following rows by adding multiples of the row width as necessary. While any
window height could be supported using a modulo operation to cycle between multiple memo-
ries depending on the row index, a multiple of two is recommended as this again simplifies the
multiplexing to a simple case of using the least significant digits since the modulo operation
in hardware can be quite expensive in terms of resources. In the 4x4 case for example, two
memory sets were implemented, each half the size of the camera frame, to store the odd and
even rows respectively. Therefore, depending on the row index of the top-left corner of the
window, 4 base addresses are generated and sent to the sets of memories, to produce 4 sets of 4
pixels with a latency of two cycles to match the performance of the rest of the Direct Tracking
Core pipeline. This latency can be reduced or incremented as necessary trading off memory
ports and the resource cost for extra multiplexers.
at Index + 2
Offset
Pixel at Index
Pixel Index: 0 0 0 0 1 0 1 0
0 0 0 0 1 0 0 00 0
Base Address: Offset: 1 0
at Index +1 at Index + 3
M #1.1 M #1.2 M #1.3 M #1.4 M #1.1 M #1.2 M #1.3
Base Base Base Base Base Base Base+ 1 + 1 + 1
Figure 4.4: Frame Cache Architecture
Moreover, it has the added benefit that when expressed in high-level synthesis as C, it allows
the tools to infer this independence and generate the correct hardware and scheduling. In this
author’s experience, the tools do not guarantee correct operation in such cases and need a
specific manner of expression to produce the expected result. For example, instead of using
base directly with a bit mask, one needs to express base as integer division (index/4) which will
still result in a bitwise operation after optimisation but will first allow the underlying compiler
to infer the independence between the read operations and therefore the correct schedule.
4.3. Evaluation 115
This takes advantage of the fact that since offset<4, the 4 sequential elements of each row are
guaranteed to be in the range: [(index/4), (index/4) + 6]. Thus, seven values are loaded in
local registers instead of four, starting from the address [index/4] which will always be aligned
on a boundary of four, and therefore always lie in the first memory set. Then by using the
value of offset, the correct four elements are selected and stored in four registers to be used in
the next part of the pipeline. In the case of FPGAs and high-level Synthesis tools, this design
utilizes extra memory ports (this has a lower cost in FPGAs compared to ASICs due to the
mapping of that memory to multiple 2-port BRAM blocks by default) to achieve a low-latency
access for the desired window.
4.3 Evaluation
In this chapter the experiments for evaluating the performance of the architecture were con-
ducted on the same board utilized in Chapter 3, carrying a Zynq-7020 FPGA-SoC [76]. Later,
the scalability of the architecture was investigated by targeting a Xilinx ZC706 Evaluation
board that offered significantly more resources to explore designs across a bigger spectrum of
resource constraints. The board setup for both was similar to the one already described in
detail in the previous chapter, with most of the differences focusing on altering the software to
allow a more efficient operation and some improvements to the drivers handling communication.
LSD-SLAM was compiled and tested as software, and then the core functionality of the tracking
task was replaced with a function that utilizes the accelerator presented in this Chapter. The
communication is achieved through user-space drivers that were designed as part of this work,
using a direct slave interface from FPGA to CPU set up through the Xilinx Vivado toolchain.
The design presented was implemented and synthesized using Vivado HLS and Vivado (v.2015.4).
It should be noted at this point that during the design of the architecture presented in this
Chapter and the following one it was not straightforward to map the intention of our designs to
what the tool allowed. A lot of scheduling and performance issues were overcome with careful
placement of resources and manual isolation of false dependencies, with some even requiring
116 Chapter 4. Accelerating Tracking for SLAM
forcing the tool to disable standard optimisations. The conclusion from this is that while they
allow faster design-space exploration, these tools lack maturity, contain many bugs and are not
yet ready for a complex production-level design.
In the scalability evaluation a newer version of the tools (v. 2016.1) was used to target the
ZC706 Evaluation board. The reprogrammable logic on that device contains 54,650 Logic Slices,
with 218,600 LUTs 545 BRAMs for a total of 19,620KBit, and 900 DSP units [2] allowing for
deeper more complex designs with higher throughput and larger caches. Due to the resource
constraints of the FPGA fabric on this smaller chip a design point with a throughput of one
Keypoint per 3 cycles was selected for the experiments from the design space we have examined.
On this board, the post-implementation resource usage was 80.8% for the LUTs (43028) and
80% for the DSPs (176). It also required 78.93% of BRAMs (110.5) and 54.84% of FFs (58352).
A frequency of 111MHz was achieved on this board. The estimated dynamic power consumption
of the SoC for this configuration is 2.7 W for the CPU and FPGA combination.
4.3.1 Custom Core Performance
At this design point and frequency the acceleration achieved by the core was 40× in comparison
to a hand-optimized software version running on the dual core ARM-Cortex A9 of the same
board, as shown in Fig. 4.5. This corresponds to a processing time of 9.49 ms in comparison
to 383.9 ms for the software version. When accounting for the cost of copying the data to
be processed to a dedicated area of the DRAM and the setup cost, the total latency for all
iterations increases to 18.34 ms shown in Fig. 4.6, with the acceleration coming to 21×. This
penalty is due to the fact that in the evaluated version, the Keyframe and Keypoints reside
in software/OS managed virtual memory, as the mapping task is still executed as software.
Therefore, for every map update, the new values of the Keypoints have to be copied to the
FPGA managed memory for the next camera frame to be tracked.
In the published article relating to this chapter, the reported latency is the 18.34 ms figure with
the memory copy cost of synchronising the Keypoints included. However, when evaluating
the accelerator as custom hardware designed to work as part of a domain-specific system, the
4.3. Evaluation 117
1 10 100 1000 10000 100000
Level 4
Level 3
Level 2
Level 1
microseconds
Processing Time per Pyramid Level (Log Scale)
Hardware Software
Figure 4.5: Processing Time Per Level
important figure is the processing time at less than 10 milliseconds. For a hardware track /
map pipeline, where the estimated map generated by the mapping task does not need constant
synchronisation with software threads as the Keyframe and Keypoints can permanently reside
in FPGA managed memory, this data copying ceases to be necessary. As such, this design
working alongside a mapping accelerator on a common memory space could achieve frame-
rates at up to 100 frames per second at this resource cost. At this performance level, Fig. 4.7
demonstrates the achieved acceleration over the low-power ARM Cortex-A9 and the accelerator
of Chapter 3.
4.3.2 Resource Scaling
Fig. 4.8, shows post-implementation resource usage for different performance points, ranging
from 1 Keypoint per 4 cycles to 1 Keypoint every 2 cycles. The scaling results are provided with
the same design, with Vivado now targeting the larger Zynq ZC706 board. The most significant
factor in obtaining increased performance is adding floating-point arithmetic units which mostly
affects DSP usage, since the bottleneck currently is not on memory bandwidth. This is due to
the wide and fast-operating input stages. However, if one wanted to continue the scaling trend
118 Chapter 4. Accelerating Tracking for SLAM
1 10 100 1000 10000 100000
Level 4
Level 3
Level 2
Level 1
microseconds
Total Time per Pyramid Level (Log Scale)
Hardware + Memory Copy Software
Figure 4.6: Total Time per Level including Memory Copy
additional or faster-operating memory ports would become necessary as the current bandwidth
becomes exhausted. For example, the input currently provides 2 Keypoints per 3 cycles at the
current configuration of a 64-bit input unit dedicated to the (x, y, z) input vector of 32-bit
values, so a pipeline that could process 1 Keypoint per cycle would be approximately 33%
underutilized.
The DSP block usage scales almost linearly with the required throughput, with some addi-
tional LUTs and FFs required for the arithmetic units and to provide more pipeline registers.
The reason the apparent gradient of LUT utilisation is lower compared to DSP utilisation,
is that LUT usage is split between computation units and registers, whereas DSPs are used
almost exclusively for floating-point math. The design implemented on the Zedboard is the
one corresponding to a throughput of one Keypoint per 3 hardware cycles, because of DSPs
availability.
Although the principles we have discussed remain the same, points after the 0.5 points per
cycle were not synthesised as part of this work for two reasons. Most importantly, as has been
mentioned before, the next available design point would require a duplication of the majority
of resources to reach the step of 1 Keypoint tracked per cycle, but at that point the bottleneck
4.3. Evaluation 119
2.54.345
22.5
0
10
20
30
ARM+NEON Accelerator in Chapter 3 This work
Performance - Frames per second
Figure 4.7: Performance comparison of this accelerator with our previous work presented inChapter 3 and with the NEON-accelerated software on the ARM Cortex-A9 as baseline.
would shift to the memory system which can currently fetch 2 Keypoints every three cycles.
That duplication, with a widening of the memory design to enable higher bandwidth would
result in a design that would require the majority of the resources available on the reconfigurable
fabric of off-the-shelf FPGA-SoC devices. That would push resource usage to a point where a
second accelerator for the mapping task of SLAM would not fit on the same FPGA fabric for
off-the-shelf FPGA-SoCs. Moreover, when this functions alongside a mapping accelerator as
planned, the 0.5 Keypoints per cycle performance point would already extend the performance
of this accelerator to a level already multiple times higher than the one used by the authors
of LSD-SLAM in practice [1] and at this point should be able to cover quite a large array of
applications as we have discussed in Chapter 2.
4.3.3 Running as part of a SLAM Pipeline
The two ARM cores on the Zynq are running a distribution of Ubuntu Linux as the operating
system. This provides the necessary libraries for the open source version of LSD-SLAM to
compile and execute. That version was subsequently modified so that the tracking function,
with the exception of some control, the solving of the Levenberg-Marquardt linear system and
the copying of some data, is completely taken over by the FPGA core. The DRAM area used
120 Chapter 4. Accelerating Tracking for SLAM
0
5
10
15
20
25
30
35
40
0 .25 0 .33 0 .5
%
KEYPOINTS PER CYCLE
RESOURCE USAGE ON ZC706
LUT LUTRAM FF BRAM DSP
Figure 4.8: Performance/Resource Scaling targeting a larger FPGA
by the FPGA is marked as an I/O peripheral making it uncached and unbuffered. This allows
simpler sharing of data between the two cores and the FPGA though it can reduce the efficiency
of writes and reads to this memory from the Linux-based OS.
However the effect is less significant in our case since mostly streams of sequential stores are
performed which rely more on the throughput performance of the memory system than the
access latency. The factor that is more significant than the use of memory that is unbuffered
on the mobile CPU’s side is that the mapping task is still on software. That means that for every
map update the updated map has to be copied to the dedicated DRAM area of the accelerator
as mentioned in Section 4.2. In fairness to related work we report the performance taking this
copy step into consideration, but this is not indicative of the actual peak performance of the
design, and can be avoided once the map data is fully migrated into the dedicated DRAM area
with a custom hardware accelerator for that task as well.
Running a full SLAM system, the performance achieved for the entire tracking function was
on average 22.7 frames/s. This was tested on a sequence of camera frames from the “Desk
Sequence” provided by the authors of LSD-SLAM [17] stored on flash memory. Before a new
frame can be processed the FPGA needs to wait for the software system to generate the required
input data from the map. This is a source of delay caused by the interfacing with the second half
4.4. Performance Analysis 121
of the algorithm, as will be discussed in the next section. The software running on the dual-core
ARM Cortex-A9 in comparison achieved 2.25 frames/s on average, an order of magnitude slower,
using both cores to process in parallel at a frequency of 667MHz. The achieved framerate, when
accounting for the software overhead in the test system, corresponds to a processing time of
44 ms. This brings the acceleration from 21× to 10× for the full system to run in comparison
to the software version and finally, approximately 5× faster in comparison to the accelerator
presented in Chapter 3.
4.4 Performance Analysis
The two input ports of the core are connected through an AXI4 interface to two of the “high
performance” (HP) ports of the Zynq, where the accelerator acts as Master, and can sustain
a high throughput for reads and writes. In addition, an AXI4-Lite interface to one of the
“General Purpose” ports where the core acts as a slave is used to receive control and parameter
signals from the software. Data is moved using the maximum bus width, in this case 64-bits,
and unpacked in the hardware into two 32-bit floating point values.
The HP ports can scale up to a theoretical 1200 MB/sec of read bandwidth each at the maximum
frequency of 150 MHz, and with a continuous stream of 64-bit requests. In a test with a simple
producer/consumer hardware pair, a sustained read and write throughput up to 1000 MB/sec
each way has been tested successfully for a single HP port. However, there are design parameters
that end up reducing the maximum achievable bandwidth, including protocol overheads, the
frequency of the master connected to this port and the overheads of the memory management
unit (off-FPGA) and the off-chip memory chip itself.
Firstly, the ports have to use the full bus width of 64 bits to achieve their full bandwidth, while
operating at their maximum frequency of 150 MHz. However, the frequency is dependent on
the rest of the Tracking Core as the transactions are issued at the same clock as the rest of
the pipeline. As a consequence of the HLS tools used the entire pipeline generated had to be
configured to utilize a single clock input. We need to process a set of 5 floating-point values for
122 Chapter 4. Accelerating Tracking for SLAM
each iteration of the Tracking Core. Transferring them using multiple 64-bit ports, a percentage
of the total bandwidth will be wasted, because of the mismatch of the memory transfers per
cycle and the corresponding consumption rate of the hardware.
Looking at the case of transferring 5 values per iteration through three hardware ports, one
can bring in 6 values per cycle. That means that the core will be able to process exactly one
5-vector per cycle if computation can keep up, wasting 50% of the bandwidth of the third port.
As a result of this mismatch there can always be a lower than 100% utilization, either on the
side of computation or on communication. However, if the tools allowed a design where the
input blocks could run at a different clock than the rest of the pipeline, these blocks could
be more closely matched, allowing for more efficient interface design for some design points.
Finally, a main bottleneck, that becomes much more apparent in smaller boards, is the available
computational units. A smaller FPGA will offer fewer LUTs and DSPs to use and can only fit
design points with a lower throughput. However, by running the inputs with a slightly faster
rate than the Tracking Core, the pipeline utilization will get closer to 100%, leading to more
efficient use of the available compute performance as the memory system’s latency is hidden.
4.5 Conclusions
The work presented in this Chapter demonstrates a high performance state-of-the-art tracking
architecture with an embedded-grade power consumption, by utilizing a SoC with reconfigurable
logic. This work proved the effectiveness of the dataflow paradigm for tracking in semi-dense
SLAM despite the non-regularity of some computations. It demonstrated the effectiveness of
separating communication and computation, to manage the existing memory bandwidth in the
most efficient way, and utilized multi-port specially partitioned caches to deal with the latency
of accessing a random window from an image frame.
The proposed architecture is utilized in a full state-of-the-art semi-dense SLAM implementation,
with an estimated 2.7 W of total power. The FPGA performed direct semi-dense tracking,
acting as a co-processor to a dual-core ARM Cortex-A9, with the full system achieving more
4.5. Conclusions 123
than 22 frames/s tracking performance. The architecture is scalable, so it can target FPGAs
with different resource profiles and is designed for significantly faster operation, thus paving
the way to a complete embedded SLAM solution.
In the next Chapter an accelerator is proposed to close the loop of SLAM and fully address the
requirements of the other half of real-time SLAM, the map generation.
Chapter 5
Accelerating Mapping for SLAM
The previous chapter presented a high-performance, low-power design to
accelerate the task of tracking in semi-dense SLAM. This enabled for
the first time the acceleration of a state-of-the-art semi-dense SLAM to
achieve real-time performance on an embedded low-power platform, without
any compromise in the quality of the result. That work paved the way
to a complete embedded SLAM solution, but since real-time operation is
composed of two strongly interdependent tasks, tracking and mapping, the
second part has to be addressed as well. These two tasks have to happen in
real-time and comprise the majority of the frame-to-frame computational
requirements of the application. Thus, the acceleration of the mapping task
was selected as the next research direction and this chapter presents this
author’s original work towards achieving the end of a complete hardware
architecture to target embedded semi-dense SLAM.
This chapter is based on a conference paper co-authored by Christos-Savvas
Bouganis [‘A Scalable FPGA-based Architecture for Depth Estimation in
SLAM’ Boikos Konstantinos et al., ARC 2019 ]
124
5.1. Motivation 125
5.1 Motivation
SLAM is composed of two interdependent tasks, strongly tied together, tracking and mapping.
In an unknown environment an accurate output from the first is needed to start generating
the second and vice-versa. So far, research and design of custom hardware for tracking has
been presented. This was the first component to address since it is the one most sensitive
to large latencies in processing and dropped frames. This work targets high performance, to
address the latency requirements of the application, and low-power consumption, to enable
the embedding of this type of application in low-power, lightweight robotic platforms such
as quadcopters and household robotics. Having designed FPGA-based accelerators to enable
real-time performance for semi-dense tracking with low-power embedded hardware, the next
challenge towards enabling state-of-the-art semi-dense SLAM on such platforms is towards
accelerating the task of mapping.
The two tasks of tracking and mapping, in Chapter 3 were confirmed to account for almost
80% of the computation cost of the algorithm. As well as demonstrating high computational
requirements, these two tasks have the same latency specifications, at, or close to, the camera
frame rate. This, together with their interdependency, means that to effectively accelerate
SLAM both have to be performed with a low latency. Everything else can be offloaded, post-
poned or computed in the background without significantly impacting the robustness of SLAM
and only affects the large-scale / long-term consistency of the solution. Moreover, most of the
inputs of these two tasks are either the output of the other, or shared. This makes the most
efficient choice the one where they inhabit the same chip, and time- and energy-costly memory
transactions and other communication are reduced.
In this context, this chapter presents a novel architecture with a combination of dataflow
processing and local on-chip caching to match the unique demands of a semi-dense mapping al-
gorithm. It is a custom FPGA-based accelerator architecture for semi-dense mapping, targeting
advanced, state-of-the-art semi-dense SLAM algorithms such as [17]. It follows and improves
upon design ideas from previous chapters, and introduces novel contributions, combining dy-
namic iteration pipelines and traditional streaming elements to achieve high performance and
126 Chapter 5. Accelerating Mapping for SLAM
power efficiency for the task of mapping.
The key contributions presented in this chapter are the following:
• The design of a scalable and high performance, power efficient specialised accelerator
architecture, that can process and update a map in less than 20ms, the average latency
between two camera frames.
• A system-on-chip that combined with the work presented in Chapter 4 forms the first, to
the best of this author’s knowledge, complete SLAM accelerator on FPGAs, pushing the
state of the art in performance and quality for SLAM on low-power embedded devices.
5.2 Mapping Algorithm - LSD-SLAM
The target of the accelerator presented here is the mapping algorithm of LSD-SLAM [17],
following the principles described above in Section 2.6. This configuration has been proven
to provide the state of the art performance discussed in Chapter 2 and maintains the best
compatibility with this author’s existing work. The tracking accelerator presented in Chapter
4 targets LSD-SLAM and as such implementing this mapping algorithm as a base for this
architecture means the two architectures can cooperate with few modifications to previous
work, and the resulting system can be utilized directly as a drop-in replacement for the software
functions completing the same task.
After a successful execution of tracking, the relative 6-Degree-of-Freedom position and orien-
tation between the latest camera frame and the world is estimated. It is then the aim of the
mapping algorithm to use that information to triangulate points from two views; the current
camera frame and a previous frame in the camera’s trajectory stored with its world-to-camera
pose along with depth information in the Keyframe data structure.
All visible points with a sufficient gradient successfully matched from Keyframe to camera
frame will have a depth value stored in this data structure. Using this information, the mapping
5.2. Mapping Algorithm - LSD-SLAM 127
algorithm adds a new observation for the part of the environment that is observed for the first
time, and performs a filtering update to improve the observations of points also seen in the past.
At the end of this process, successfully observed points in space will have an estimated depth
and depth variance value stored in the Keyframe. The algorithm also keeps track of metadata
relating to how well this point has been tracked from different points in the camera’s trajectory
after being created. This information is used at different parts of the algorithm to add heuristic
optimisation which improves quality of depth estimation and the overall robustness.
As discussed in Chapter 3, the tasks involved in SLAM were profiled, running as software on
a high-end desktop CPU. The results showed that the mapping task was one of the heaviest
tasks happening during LSD-SLAM, at 38% computation time for all cores. Further testing
on the ARM-Cortex A9 of the FPGA board verified that the profiling results held true, with
timing tests measuring the mapping task at an average of 530ms per map update.
Algorithm 7 is a simplified high-level view of the tasks and functionality of the depth estima-
tion of LSD-SLAM that are implemented in the coprocessor. The pseudocode presented here
essentially summarises the tasks discussed in this Chapter and in Section 2, and introduces the
functionality in the order that it is implemented in hardware and discussed in Section 5.3. The
first step is updating the frame cache and Keyframe cache in the accelerator. These caches are
of the same size as the frame’s resolution, and store an up-to-date copy of the pixel values for
both the current camera frame and the Keyframe.
Then the three functions of generating new matches and observations, regularising by filling
gaps and calculating a smoothed regularised value are essentially executed in parallel as their
parts are implemented in the different hardware units demonstrated in Fig. 5.2 and discussed in
the following section. Their functionality is summarised in pseudocode statements that can be
used as a reference while reading through Section 5.3. Comments in green are used to indicate
which software functions the statements correspond to, but all the statements are executed in
sequence, in a dataflow fashion for all of the Keypoints.
At this point we copy the variable definitions stated first in Chapter 2 for ease of reference.
128 Chapter 5. Accelerating Mapping for SLAM
– Pose: Camera to world pose estimate for the Keyframe’s camera frame.
– Frame: A camera frame that was selected as a Keyframe candidate and its associated
metadata.
– Keypoint Array: Array has same dimensions as Frame (width x height). Each Keypoint
is a potential depth observation, and is composed of the following variables:
– Valid bit (Is there a valid observation for this pixel)
– Blacklisted (Has it been blacklisted after multiple failures to match)
– Validity Score (Used to keep track of successful or failed attempts to match)
– Inverse Depth (Depth estimate after a map update)
– Inverse Depth Variance (Estimated variance for above depth)
– Smoothed Inverse Depth (Smoothed version of inverse depth)
– Smoothed Inverse Depth Variance (Smoothed version of variance estimate)
– Max gradient array: Same dimensions as Frame. The maximum gradient in a region
around every pixel. A precise definition of its calculation is included in Chapter 3.
5.3 Architecture of the Mapping Accelerator
5.3.1 Coprocessor architecture and FPGA-SoCs
The architecture targets an FPGA-SoC that contains an FPGA fabric and a mobile CPU.
These platforms have been discussed in previous chapters, and the principles here on a system
level are similar to the ones discussed in Chapter 4, with the exception that in this work the
output bandwidth is more important and was designed to specs similar to the input bandwidth.
The CPU and FPGA function independently and can operate on the same memory space and
both have direct access to a common physical DRAM. There are master memory controllers
on the custom hardware for Direct Memory Access (DMA), designed to operate at full-speed
5.3. Architecture of the Mapping Accelerator 129
Algorithm 7 Update Map and Regularize
procedure Update Map and Regularize(Keyframe, tracked Frame, pose estimate,K)#Update and filter functions were fused into one streaming accelerator
Update Tracked Frame Cache #Copy latest tracked camera frame on-chip
if New Keyframe was generated thenUpdate Keyframe Cache #Gets triggered after create Keyframe executes
#Function Observe Depth
for each Keypoint k doCalculate max gradient in neighbourhood
if max gradient > gradient threshold thenCalculate epipolar line
if line and parameters are valid thenStart exhaustive search
if Successful Observation thenCreate observation or update depth value
#Function Regularize Fill Gaps
if k.valid == 0 then #No valid observation yet at pixel
#Sliding window buffering
Calculate average confidence in window of valid Keypointsif neighbourhood confidence > confidence threshold then
Calculate weighted depth average #Weigh by V alidityScore
Initialize new Keypoint k #Gets weighted average values here
k.depth = depth averagek.valid = 1
#Function Calculate Regularized Depth
#Delayed sliding window, buffering results of Regularize Fill Gaps
if k.valid == 1 thenCalculate smoothed depth and variance valuek.smoothed inverse depth = smoothed depthk.smoothed inverse depth variance = smoothed depth variance
bursts for updating the caches before operation or to provide a constant stream of map points
for the execution of the algorithm. In addition to the high-speed memory connections, there
is a direct slave-to-master connection to the CPU, where the CPU acts as a master. In this
manner, the CPU has the high-level control of the coprocessor on the FPGA, and can change
its operating parameters and coordinate its operation with the software back end. Fig. 5.1, an
annotated figure from Chapter 2, demonstrates this system architecture.
In general, the co-processor architecture is designed to perform semi-dense mapping as part of
LSD-SLAM, with an existing Keyframe with depth and metadata for every observed pixel as
input and a filtered and updated version of that Keyframe with new and updated depth values
as output. In addition to the base operations of stereo matching and scanning along a variable
130 Chapter 5. Accelerating Mapping for SLAM
DRAM Controller
Reconfigurable Logic
512kB L2 Cache and Controller
L1 I/D Cache
Memory Interconnect
64-bit
Master Interconnect
for Slave Peripherals
FIFO
FIFO
FIFO
FIFO
32-bit
64-bit
ARM Cortex-A9Dual Core
Tracking Accelerator
General Purpose Slave Ports
Mapping AcceleratorHigh Perf.
AXI (HP[3:0])
Figure 5.1: Zynq 7-series FPGA-SoC and Interconnect. Only connections relevant to thearchitectures researched in this thesis are included for clarity of presentation.
baseline using epipolar geometry, the heuristic features of LSD-SLAM have been included in
the hardware, for example pixel quality metric, and limiting scans to a confidence interval on
the epipolar line as discussed in Chapter 2, performed in a streaming dataflow pattern. This
choice was made to keep compatibility with this state-of-the-art method and maintain the same
accuracy and robustness.
Nevertheless, in order to increase the performance that is attainable by the proposed custom
hardware design, the actual hardware implementation is modified with respect to the original
software implementation. For instance, a number of values such as the maximum gradient
in a neighbourhood were more efficiently calculated on the fly than pre-computed as done in
software. Additionally, most of the functions in the algorithm are combined in one streaming
pipeline utilizing buffers to overlap computation, as this avoids redundant memory traffic and
significantly improves performance and power efficiency. In contrast, software splits tradition-
ally the computation into different functions, looping over the same data a number of times,
once for each function.
5.3. Architecture of the Mapping Accelerator 131
5.3.2 High-level Architecture Overview and Functionality
As a first step, the “layout” of the hardware pipeline will be described with its high-level func-
tionality and some of the system-level and interface design choices that were made. Figure 5.2
contains a high level view of this architecture omitting some connections for clarity, such as
constant propagation and some extra connections to off-chip interfaces. The figure begins with
the I/O controller and lines, and aims to show the dataflow path, as well as the information
transferred. The connections between units are labelled up to the point of one Keypoint per 5
Cycles which shows the nominal processing, and production/consumption rate of the presented
architecture at its default configuration. In later subsections, the architecture’s scalability will
also be discussed, changing some of these parameters to target different performance points
with different resource requirements.
As a first step, before performing the depth-update that this accelerator is designed for, the
hardware performs an update of the caches. The write connections are not visible in Fig. 5.2 to
simplify presentation, but in the green and red arrows we can see the paths used for the majority
of the runtime, which are the read buses to two different processing units. The Keyframe pixel
cache does not need to be updated for all update steps, until the transition to a new Keyframe,
at which point the depth values are propagated to that (overlapping pixels are assigned the
depth value from the old Keyframe and an increased uncertainty), and a new camera frame
is used for the Keyframe. In contrast, for every Keyframe update a copy from the off-chip
DDR memory to the frame cache is required to fetch the newest tracked camera frame pixel
information. Other data, such as the pose of that tracked frame, are stored in hardware registers
updated through a userspace driver running as software on the mobile CPU. For the update of
these values and the rest of the control, the CPU uses the slave interface to the FPGA.
Following that, as input the architecture starts reading all the points of the Keyframe sequen-
tially from the off-chip memory. As soon as every updated Keyframe row is computed, it starts
streaming its output back to the same memory, at a different location. The functionality of
the first two units is to ensure a fast and consistent stream of Keyframe points. The “Input
Memory Controller” performs full-speed burst reads from the off-chip memory, that are then
132 Chapter 5. Accelerating Mapping for SLAM
BurstRead
64bits Per Cycle
Fast Pipeline
KeypointPer 3 Cycles
Input Memory
ControllerUnpack Unit
Keypoint and Gradient
Check
Epipolar Line and 5-Point
Unit
Generate Scan Points
Cache Request Handler
Subpixel Intensity
Calculation
Loop Processing
Unit
Fast-rate Pipeline
New Depth Calculation
Subpixel Stereo
Depth Integration
Fill Gaps Filter
Depth Regularize
Filter
Pack and Output
Controller
BurstWrite
KeypointPer 5 Cycles
Keyframe Pixel Cache
Camera FrameCache
Offchip Memory
Offchip Memory
…
……
…
Figure 5.2: Block diagram of the accelerator architecture
buffered and streamed as Keypoints from the “Unpack Unit” to the rest of the pipeline. The
information making up one point includes a confidence rating and validity indicator, as well as
past predictions for depth and depth variance. These points are comprised of precisely 192 bits
each, fetched as three words of 64 and later unpacked to the values they represent, as follows:
• unsigned 8-bit integer Valid → Flag indicating if depth estimate is valid for this point
• 8-bit integer Blacklisted → Indicator of repeated depth estimation fails
• 8-bit integer Validity score → Weighted score indicating rate of successful observations
• 8-bit integer Op-code → Used exclusively from the FPGA, exported to read from the
software, diagnose issues.
• 32-bit reserved Word→ Not currently used, kept as place-holder, to keep word size equal
to software implementation, at exactly 192 bits, and simplify Input and Output stages.
• 32-bit floating-point Inverse depth → Inverse depth estimate
• 32-bit floating-point Inverse depth variance → Inverse depth variance estimate
• 32-bit floating-point Smoothed inverse depth→ Smoothed Gaussian for visualisation,
search initialisation and other tasks
• 32-bit floating-point Smoothed inverse depth variance → Variance for smoothed
Gaussian
5.3. Architecture of the Mapping Accelerator 133
Internally, a representation of a single bit is propagated along the pipeline for the Valid flag
to reduce resource usage, especially but not exclusively, in the case of on-chip Block RAMs.
For all the other variables, the same word size and data type as above was used and a buffered
stream was created for each one. These streams are bundled together as a Keypoint stream,
indicated by the main arrows in Fig. 5.2
As discussed in Section 2.6, the first step once target points start streaming in is to determine
which ones are valid, and good candidates, with a local gradient above a certain threshold, to
attempt to match. This starts by checking their valid flag, a pixel quality score maintained
over successive attempts to map, as well as calculating on the fly the max gradient of each
pixel and its immediate neighbours. The condition is that the maximum magnitude of the
intensity gradient in this group of pixels must be above a threshold to consider matching this
pixel later on. As Keypoints start streaming in, the hardware buffers the first four rows of the
Keyframe, and maintains a 5x5 register array which maps to a sliding window panning over
the Keyframe’s intensity values left to right, in successive rows as shown Fig. 5.3.
Maintaining the reading pattern of Fig. 5.3 is important to ensure the most efficient access
pattern for heavily re-used memory regions. By taking advantage of the fact that the points
of the Keyframe are accessed sequentially and in this part of the processing all accesses are
going to be in an immediate neighbourhood of a 2-pixel radius, multiple accesses to the same
value are mapped to registers instead of going to the cache. To enable the implementation of
this register window, an additional four row buffers are used of equal size to the image width.
The sliding window itself is implemented with an array of 25 registers, with the current target
stored in the centre. A diagram of this configuration is presented in Fig. 5.4.
As this scanning happens, these registers are used by the “Keypoint and Gradient Check” unit,
which is responsible for the on-the-fly calculation of the maximum gradient in a neighbourhood
of the pixel. First the gradients in the two image dimensions, horizontal (denoted with the
variable x here) and vertical (denoted with the variable y), are calculated for the set of five
pixels. These are the target pixel, the ones immediately left and right of it, and the ones directly
above and below. Following that, a gradient magnitude is calculated for these five pixels as the
134 Chapter 5. Accelerating Mapping for SLAM
Figure 5.3: Sliding window over current keyframe
KeyframePixel Stream
Row 0 Buffer
Row 1 Buffer
Row 2 Buffer
Row 3 Buffer
X
Figure 5.4: Sliding window utilizes shift registers and 4 row buffers
5.3. Architecture of the Mapping Accelerator 135
X
Bottom
Top
Left Right
dy
dx
(x, y)
(x, y+1)
(x, y-1)
(x-1, y) (x+1, y)
𝑔𝑟𝑎𝑑(𝑥, 𝑦) = 𝑑𝑥2 + 𝑑𝑦2
𝑑𝑥(𝑥, 𝑦) = 𝐼 𝑥 + 1, 𝑦 − 𝐼 𝑥 − 1, 𝑦
𝑑𝑦(𝑥, 𝑦) = 𝐼 𝑥, 𝑦 + 1 − 𝐼 𝑥, 𝑦 − 1
Figure 5.5: The intensity gradient is calculated in the two image directions, for the target pixeland its four immediate neighbours, resulting in a total of 13 accesses, 20 gradient calculationsand a final reduce operation to find the maximum value in the region
square root of the sum of the squared horizontal and vertical gradient√dx2 + dy2 as shown in
Fig. 5.5. This represents a total of 20 accesses completely served by the 5x5 register window,
which results in lower latency and an improved performance over direct cache access. Finally,
the maximum of these values is calculated and compared to the threshold to result in a pass or
fail condition.
In Fig. 5.5, a cross pattern of reads is employed. For each maximum gradient calculation,
20 sets of subtractions are scheduled on an array of floating-point pipelined adder units and
then the results are squared and accumulated at a square root hardware block. The whole
chain of operations is pipelined and allocated to a time-shared array of adders and multiply-
accumulators to match the desired processing rate, a process that will be discussed further in
the next subsection.
Based on the gradient threshold for the area and the pixel’s confidence rating from the metrics
stored in the variables Blacklisted and Validity score, the Keypoint’s fitness is calculated as a
candidate to try to map. It is then forwarded to the “Epipolar Line and 5-Point Unit” that is
responsible for calculating the epipolar line’s equation and its overlap with the camera frame
to determine the scan range, centre and steps, as in the discussion of Section 2.6, demonstrated
here in Fig. 5.6.
136 Chapter 5. Accelerating Mapping for SLAM
Real World Point X
OR
X2X1
X3
eL eROL
XL XR
Epipolar Line
Figure 5.6: Epipolar geometry, the epipolar line is depicted with orange colour. While thepoint will lie on the line, it does not have to appear on the camera’s frame as it can lie outsidethat plane.
This unit is firstly characterised by more complicated control, since several operations will
be skipped or repeated depending on the values during runtime, leading to the necessity of
some extra hardware units allocated for the worst case scenario. For example, an epipolar
line segment that is too short in the confidence interval will have to be “padded” on the fly
and if the starting point is outside it will be moved toward the border of the frame and the
whole line segment will be re-tested. It also contains several floating point operations, including
multiplications and divisions to calculate and check the robustness of the search, including the
epipolar line angle, the portion inside the camera view and the length of that section. Finally,
this unit calculates the 5-point SSD pattern, as discussed in Section 2.6, that will be compared
later in the loop processing unit to find the best match along the epipolar line.
To optimise this for a hardware pipeline, some conditional execution paths were simplified with
some logic checks triggering various flag bits, propagated along the pipeline to reduce resource
usage. The one exception that adds complexity is a computation path where, if the epipolar
line has one side of it partially outside the frame, that point of the segment will be moved
5.3. Architecture of the Mapping Accelerator 137
inside the frame and the segment will have some checks repeated. This comes with dedicated
hardware resources, but was kept as part of the effort to maintain equivalence with the software
method in hardware. If all the checks are valid, at the end this information is forwarded to
the next section of the pipeline, together with a pre-calculation of the scan iterations and steps
and the five points to scan for. If it fails, the map point is still forwarded to be used for later
processing such as filtering, followed with flags to mark this decision and the reason for failure,
but it will not trigger a scan.
The next group of hardware blocks is a collection of units that will be referred to as the fast-rate
pipeline. This part of the proposed architecture aims to efficiently map the central part of the
algorithm, which contains a dynamic number of iterations per point ranging from 1 to tens,
to a buffered unit that will receive Keypoints at a rate matching the rest of the pipeline but
perform the necessary processing at a faster rate to account for the varied duration of these
dynamic iterations. In the fast-rate pipeline, the necessary cache accesses and error calculations
for the scanning of the epipolar line and the selection of the two best candidate locations are
performed, at a faster rate than the rest of the hardware. The thicker lines at the centre of the
units describe the streams for the information pertaining to this scan, shown in Fig. 5.2, while
the thinner lines correspond to the map point together with its metadata being forwarded,
coming from the units outside.
In this faster rate pipeline, the “Generate Scan Points” unit supplies a steady stream of pixel
locations to be fetched from the cache unit, according to the calculations in the Epipolar line
unit. This unit essentially acts as a bridge between the two rates of processing. One side of it
reads available target points (that are either ready to be processed, or turned out to fail some
condition at some step of the pre-processing). For the ones that will require a scan, part of
the information read is the start, finish and step size of the scan. Using this, a second pipeline
is initialised inside this unit for a predetermined number of iterations. It writes out a rapid
succession of coordinates for the points to be scanned, that are forwarded to the cache request
handler. Meanwhile the other metadata is also passed along a slower line through the Cache
Request and Subpixel Intensity Calculation units, through to the Loop Processing Unit that
has a dual pipeline inside as well.
138 Chapter 5. Accelerating Mapping for SLAM
Using the points coming from the previous unit, the “Cache Request Handler” fetches these
pixels from the caches and forwards them to the “Subpixel Intensity Calculation” unit where
linear interpolation is performed in a neighbourhood of 4 pixels around the floating point
coordinates. This separation of functionality is deliberate. By separating the calculation of
pixel coordinates, for a number of iterations that are unknown by the unit in its functionality,
but come as just-in-time information from previous units, it is possible to separate the access
address to the actual memory requests by buffers, and then optimise separately. This has the
effect of allowing simpler, smaller units, that as soon as their buffers start filling up will ramp-
up to a sustained access rate, operating at a high utilization rate and offering high performance
with a comparatively lower design complexity.
All these streams are passed on to the “Loop Processing Unit” (LPU) that performs the core of
the scanning algorithm, the functionality of which is demonstrated in Fig. 5.7. On the left side
of the figure the successive scanning steps ( overlapping sets of 5 comparisons ) are illustrated
as white dots on the epipolar line. The matching error on the right side of the figure does
not have units, as the error is a relative value, produced from the sum of squared errors in
pixel intensity. The horizontal axis represents scan steps, the stride of which is relative to the
baseline and resolution for the scan.
The LPU block first reconstructs the pattern of 5 pixels that are scanned for, using flags to
identify the type of data it is receiving, as the previous units are intentionally agnostic of the
algorithm stage they are operating for. It then performs the scan steps to find the position
with the minimum sum of squared errors. It stores internally the best match and the second
best match. It also maintains additional information regarding the search. This includes the
steps performed, the distance of the search and the matching error. As we can see in Fig. 5.7,
in successful matches there is usually one strong minimum, with an error function that appears
relatively convex. To estimate if a best match location indicates a successful match or a false
positive a number of metrics are checked, including the magnitude of the error and if the two
best candidates are more than a step’s width apart that their errors differ by more than a
certain factor (for example 1.5×).
5.3. Architecture of the Mapping Accelerator 139
Matching Error
Figure 5.7: For each comparison, the previous four values are re-used and a new one is calculatedby interpolating the values of the four pixels surrounding the floating point coordinates of thenext scan point
On a match considered successful, a linear interpolation step is added that attempts to further
optimise the solutions by interpolating between the two best discrete error positions (the func-
tion in the figure is not continuous but rather a set of samples). In other cases, especially in
larger baselines, repeating patterns or other sources of error (e.g. occluded parts of the scene)
will lead to multiple local minima and potentially false matches. For example a pattern such
as the one in Fig. 5.8, with its repeating texture, will lead to multiple local minima with com-
parable errors. The distance between the two best matches and the magnitude of the matching
error can be employed as discussed in the previous paragraph to potentially reject this obser-
vation as invalid. To improve the accuracy of the algorithm, different heuristics such as this
are employed to attempt to detect and remove certain sources of error, such as “blacklisting”
points if they repeatedly fail to be tracked successfully.
After a scan is completed, this is forwarded to the “New depth Calculation” unit, that continues
the processing of the mapped Keypoint at the slower rate of the rest of the pipeline. This
calculates a new depth and depth variance value based on the results of the LPU as described
in the previous paragraph, which the next unit “Subpixel Stereo” can further refine if the
conditions are right for a sub-pixel disparity search, as discussed in the previous paragraph.
140 Chapter 5. Accelerating Mapping for SLAM
Matching Error
Figure 5.8: In this case, intensity information, combined with a large baseline in absense of astrong previous estimate, is insufficient to provide a good match.
Finally, this information is streamed to the “Depth Integration” and the Filter units. The first
is responsible for putting all the information together for each map point, integrating results
from the two units previous and updating the metadata of the point, described in the beginning
of the section, as necessary, including a revised confidence rating. The filter units perform
regularization operations. The first one, upon finding sufficient confidence in observations in
a window around a pixel that lacks an observation itself, can add an estimate for it with a
weighted average of its valid neighbours.
The second filter, “Depth regularize filter” calculates a smoothed value for the depth and
variance of valid map points, stored separately to the actual depth, again operating on a
sliding window around a centre pixel. For both filter units, row buffers allow region of interest
processing, without losing the efficiency of the streaming interface, but simply adding one time
latency to the total latency of the accelerator. For both cases there is buffer priming before
the first data point even arrives, and the total latency can be well estimated as the amount
of time to receive 4 rows of the Keyframe, at less than 1% of the total running time for a
vertical resolution of 480 pixel rows. After the processing and filtering finishes, the operations
performed at the input are reversed in a pack-and-output unit that streams they points of the
Keyframe out to the off-chip DRAM utilizing burst write transactions.
5.3. Architecture of the Mapping Accelerator 141
5.3.3 Multi-rate dataflow operation
Semi-dense SLAM is characterised by a large amount of data that needs to be processed. For
a map of a typical for the state of the art size of 640 × 480, the depth map representation
will take up 7.37MBytes. That is in addition to the actual frame size of 307KBytes. To put
that into perspective, in order to process 60 frames per second as they come from a camera
and extract their depth information, the total time between captured frames is less than 17ms,
but that amount of data requires approximately half that time just to be read from memory
using a dedicated port on the reconfigurable fabric at the typical memory bandwidth available
on off-the-shelf FPGA-SoCs. To keep up with that latency, at a typical frequency achievable
on an HLS-designed FPGA accelerator of 100 MHz, it would be necessary to process one map
point every 6 cycles on average.
For devices of the Zynq family targeted throughout this thesis, the maximum bandwidth of the
memory controller is 4.2GB/s as presented in the Xilinx Zynq Technical Reference Manual [76]
but with a real achievable bandwidth depending on the type of access (efficiency of sequential
reads/writes will be higher than random accesses) and reported for this controller in the same
TRM in the region of 80-90%. On the FPGA side, there are 4 HP-ports, as discussed here
and in Chapter 2, that can handle a maximum of one 64-bit request per cycle, at a maximum
frequency of 150 MHz for a theoretical maximum of 1200MB/s per port. For a typical design
that can generate requests in two HP ports, at 100 MHz, the achievable bandwidth for sequential
reads/writes will be in the region of 1.6GB/s, which means an 8MByte sequential read will need
approximately 5 ms to complete.
A straightforward implementation would be to design units capable of performing a full epipolar
line scan in that time. However, the average computational cost for a scan is three times less
than the peak computational cost. Simultaneously, for most rows the majority of Keypoints
will not require an epipolar line scan. If a design targets a fixed latency of 6 cycles for the worst-
case load, it will result in an underutilized and less power-efficient accelerator with significantly
higher resource utilization.
142 Chapter 5. Accelerating Mapping for SLAM
Alternatively certain properties of semi-dense SLAM can be leveraged to design a more efficient
solution. An epipolar line scan often is not required when the point does not currently contain
a valid observation or is not visible in the current frame. Moreover, in confident observations,
it can be safely reduced to the region d ± 2σd, as described in Section 2.6 of the Background
Chapter. The designed coprocessor takes advantage of the pattern and frequency of the afore-
mentioned cases by utilizing fully pipelined units, each designed to efficiently execute a part
of the computation of the entire algorithm, as discussed in Section 5.3.2. The units are sepa-
rated by large buffers, to not propagate the variable processing delay backwards through the
pipelines. They are also designed with streaming dataflow in mind, allowing a design to be-
come rate-based and be decoupled in a large part from the possible high processing latency of
a particularly busy neighbourhood of pixels.
The most efficient design was found to be self-contained, deeply-pipelined hardware blocks that
perform the different types of operations on a static schedule, re-using a number of compute
units to achieve a certain rate of processing per cycle. The input and output of these pipelines
is a streaming buffer, that allows a unit to stop functioning, until a next data point is available.
At the same time, different parts of the algorithm can be overlapped in the same hardware
units, re-using some of the hardware for different operations, everything being designed with
the principle of data always moving forward. The pipelines contain multiple compute units for
multiplication, addition and division, and logic and multiplexers shift the structure of the unit
as necessary. This way they can change from an initialization phase, to operating on points,
to scanning across the frame cache, depending on the unit, or skipping a scan and forwarding
metadata to the next unit in the pipeline.
The units were also designed to operate at different rates, with fast-rate processing units in
the fast-rate pipeline to access the frame cache, perform epipolar scans and find the two best
matches, and more relaxed processing at the surrounding stages. The on-chip cache access hap-
pens at a rate of one access window (2x2 pixels on the implemented design) per cycle buffered,
allowing a simple and highly efficient cache controller in place. The units are connected to
each other through large streaming FIFO buffers that allow communication to happen asyn-
chronously, and hide penalties on latency that would arise from the variable processing rate
5.3. Architecture of the Mapping Accelerator 143
design. In this way this architecture achieves a higher performance level for the size of the
targeted reconfigurable devices, using resources more efficiently than a pipeline designed to
operate at a rate targeting the worst case scenario for the scanning latency.
As shown in Fig. 5.2, the units in the centre of the pipeline inside the orange box operate at
a faster rate than the rest. Focusing on the region of the fast-rate pipeline as demonstrated
in Fig. 5.9, there are three modes of operation for the units at the sides. When an epipolar
line scan and depth update is necessary, they perform initialisation steps and then set up the
rest of the control to perform a scan. In the second mode, they simply generate one scan step
per cycle. Finally, if there is a point that does not require a depth update, they just directly
forward that point’s associated metadata to the next unit in a single cycle. They essentially
act as a dual-rate interface between the flow of data through the rest of the algorithm and the
scan steps happening inside.
Epipolar Line and 5-Point
Unit
Cache Request Handler
Subpixel Intensity
Calculation
New Depth Calculation
…
…
In. it.
Scan Calc.
1 Cycle -Processing
Interval
Generate Scan Points
Loop Processing Unit
LPUMain
Res. Fwd
Fast-rate Pipeline
5 Cycle -Processing
Interval
Figure 5.9: The units in the fast-rate pipeline operate at two rates simultaneously, with somecontrol processes, initialization and communication with the rest of the pipeline happening onlywhen starting or finishing a scan and and the main compute units operating at a rate of onescan step per cycle.
As such, part of the pipeline in Generate Scan points reads a set of values from the Epipolar
Line unit to initialise the counters and control for the rest of this unit. In a similar fashion, the
LPU main unit processes scan steps but needs to be aware of the parameters for the current
scan indicating when to initialize a new scan or integrate and forward the final results to New
Depth Calculation Unit. The metadata that guide this process pass along buffers every time
a new scan starts, indicated in Fig. 5.9 by the thinner arrows between the units, while the
144 Chapter 5. Accelerating Mapping for SLAM
higher-rate information dealing with the scan passes through separate buffers indicated with
the thicker lines in the figure. The units in the middle, Cache Request Handler and Subpixel
Intensity Calculation, do not have to be aware of the part of the process they are serving. This
is enabled by taking advantage of the symmetry of initialization accesses and scan accesses, so
that these units simply process data in the same way and pass them forward.
As has been mentioned, this architecture is scalable in terms of compute units allocated for the
hardware blocks, a process guided by the desired processing rate inside the fast-rate pipeline
and for the rest of the hardware. By reducing the amount of multiplier, divider and accumula-
tion units built in each hardware block and time sharing them more aggressively for different
operations, it is possible to increase the amount of cycles necessary for a scan but with an
almost linear decrease in resources for that unit. The most efficient designs must have units
in the fast-rate pipeline match with each other, as otherwise the slowest one would dictate
rate of processing making the rest of the units underutilized. In a similar fashion the rate
of processing for the units before the fast-rate pipeline should be tuned as one number, and
the same or slightly slower processing rate should be targeted for the units after the fast-rate
pipeline. Thus, essentially two numbers, these two processing rates trade off between targeted
performance and resources. The resulting architecture therefore can scale to different FPGA
devices and resource budgets. In Section 5.4 different example design points are presented,
which were achieved by changing the target processing rates as described previously.
5.3.4 Performance Analysis
To explore the hardware rates, as described in Section 5.3.3, that maximise the achievable
performance given the available hardware resources and verify that the design assumptions
discussed so far hold when running with real-world datasets, monitoring instrumentation was
added in the software version of LSD-SLAM and it was executed for the entire duration of the
datasets used in the evaluation section. Firstly, statistics were collected regarding the average
processing load that is expected for each iteration of a map update across all the mapped
frames. Then more detailed samples were acquired to study the distribution and extrema of
5.3. Architecture of the Mapping Accelerator 145
the amount of epipolar line scan steps per map point.
100 200 300 400 500 600
50
100
150
200
250
300
350
400
450
0
0.05
0.1
0.15
0.2
0.25
0.3
Figure 5.10: Heatmap of Depth Map valid points for epipolar scan. Axes represent imagecoordinates, with colour representing the frequency of a Keypoint in those coordinates requiringa scan
A subset of this data is visualised on the heatmap of Fig. 5.10. Its dimensions are equal to
the Keyframe image size used in the tested datasets and a typical LSD-SLAM implementation.
Each cell corresponds to a pixel of the Keyframe, and the colour indicates the frequency (0-1)
that that pixel would require an epipolar line search of any length for the duration of an entire
dataset. We can see that the frequency of use of any particular point rarely goes around the 30%
mark, and for most cases it is nearer to 5-15%. This figure is also interesting because it reveals
the tendency of the algorithm to focus on repeating patterns, in this cases it is a particularly
busy texture that is tracked in different positions in different Keyframes. It also demonstrates
an intuitive fact, that points of interest will concentrate more on the middle three/quarters
of the Keyframe and less at the top and bottom. This is expected, as to repeatedly map a
point successfully it needs to appear in successive Keyframes, something which has a lower
146 Chapter 5. Accelerating Mapping for SLAM
probability of happening at the top and bottom borders of the camera, but can happen at the
left and right borders as the camera in the selected datasets pans more often left or right than
up or down.
Sampling for peak loads across a dataset revealed that the amount of points per line that require
scanning peaks around the centre of the image at a frequency averaging at 18% of points in
a single row. By looking for extrema some outlier cases were discovered, which however were
usually less than 1-2% of the frames processed. Those have to do with special cases consisting
of initialisation steps or very sharp motions. However, the worst case scenario will always
have an upper bound, found by assuming all buffers are full. That latency ceiling has a linear
relationship with the fast-rate pipeline processing rate and the processing load per frame, as in
that case the rest of the pipeline has to wait for every scan to finish before forwarding the next
point. Hence, this behaviour can be predicted and designed against.
The average epipolar line scan length was calculated at 11 steps for the tested datasets. Given
the results in Fig. 5.10, if 25% of the points in a line require an epipolar scan, and the other 75%
are skipped in one cycle, the total cycles per row would be 2240 cycles, or 3.5 cycles per point
for an average of 11 scan steps for the scan (fast-rate) pipeline. In the implemented unit, the
design choice was a processing rate of one scan/interpolation per cycle in the fast-rate pipeline,
matching the figures above, and a processing rate of one target point per 5 cycles in all the
other units, which targets an overall latency of more than 62 frames per second for a resolution
of 640x480. This leaves a good margin of safety to absorb the latency of thousands of extra
scans above the typical values without dropping below the target framerate. It was found that
for the datasets tested, this relationship of 5-to-1 was a good ratio for the processing rates of
the pipelines with performance being very close to the theoretical maximum for the majority
of map updates.
As we will discuss in the evaluation, in real-world testing the pipeline did indeed perform as
expected, with less than 1 update in a thousand deviating significantly from the performance
targeted. Moreover, in actual tests the software version on both tested platforms had a worse
behaviour in such cases with an increase of almost 200% in the processing time for some frames.
5.4. Evaluation 147
These results are discussed in more detail in Section 5.4. Thus, this work, besides delivering
the promised performance and performance-to-watt figures, delivers a much more predictable
performance which guarantees far fewer frame drops and map updates than a software platform,
meaning a higher quality result for SLAM when operating in real-time, in real-world conditions.
Lastly, the proposed architecture is tunable and can be changed to adapt to different application
requirements. One option is to increase the capabilities of the fast-rate pipeline to have the
system guarantee a very small performance degradation, even in outlier cases, at the cost of
some underutilized resources. Alternatively, if the application allows, one can go the other way
and under-provision the fast-rate pipeline to target a more resource-and-power efficient system,
by allowing some degradation of a few percentage points in more cluttered scenes. In Fig. 5.11
of the Evaluation, we will see the scaling to target different performance points. The 32.5 fps
and 42.5 fps are examples of a design point where an extra cost in resources guarantees a lower
maximum latency, and therefore a higher target performance.
5.4 Evaluation
5.4.1 Experimental Setup
In this chapter, all of the experiments were conducted on the Xilinx ZC706 Evaluation board.
The board comes with a Zynq-7045 FPGA-SoC [76] and two DDR memories, each 1GB of
DRAM, of which only one was accessible to both the FPGA and to the mobile CPU and is
the one used in these experiments. The SoC contains a dual-core ARM Cortex-A9 clocked at
667 MHz and buses connecting the CPU to off-chip memory, and directly to the FPGA fabric.
On board, as is the case for the experiments conducted in Chapter 4, the mobile dual-core
CPU runs an Ubuntu distribution of the Linux Operating System. LSD-SLAM was compiled
and tested as software, and then the core functionality of the tracking and mapping tasks was
replaced with calls to the accelerator unit through userspace drivers that were designed as part
of this work, using a direct slave interface from FPGA to CPU set up through the Xilinx Vivado
148 Chapter 5. Accelerating Mapping for SLAM
toolchain.
5.4.2 Benchmark selection and Platforms
The Benchmarks used for the design exploration and the performance comparison of this work
were the Room and Machine Hall trajectories supplied by LSD-SLAM’s authors on TUM’s
website: https://vision.in.tum.de/research/vslam/lsdslam. These were run in three
platforms, in a non-real-time setup, where the tracking and mapping tasks were run separately,
on a step-by-step basis, once for each frame to maximise the software performance on general
purpose hardware and to avoid affecting the experimental results by effects other than the
algorithm and hardware’s performance. The first general purpose hardware platform is a high-
end desktop machine, with 16GB of DDR3 RAM and an Intel Core-i7 4770 CPU with Turbo
mode enabled running at a sustained 3.77 GHz with all cores loaded executing the software. The
second one, to provide a state-of-the-art comparison point for mobile/embedded platforms, was
an Nvidia Tegra X1 board, utilizing the on-board ARM Cortex-A57 CPU clocked at 1.73 GHz
alongside 4GB of onboard RAM.
5.4.3 Design Implementation and Resource Usage
The architecture described in the previous section is designed to be platform agnostic and
optimised on resource usage. Nevertheless, the use of Vivado HLS tools drove a number of
implementation decisions in order to develop and test the IP on the target FPGA-SoC, leading
to certain overheads1. For evaluation, the design was synthesized and placed-and-routed with
Vivado HLS and Vivado Design Suite (v[2018.2]), targeting a Xilinx Zynq ZC706 board and
run and tested on the same board. For the parameters described in Section 5.3, timing was
met for the coprocessor at 125 MHz. The resource usage for that result, post-implementation,
1For example, since the tool always rounds up memory size to the next power of two for BRAM utilization,a choice was made to partition in two dimensions cyclically by a factor of 5, a non-power of 2 factor. Thissignificantly reduced the memory overheads, fitting eventually both accelerators on the same device in thehighest performance configuration, at the cost of increased DSP and LUT usage.
5.4. Evaluation 149
Resource Map Unit Track and Map Units Available on Z-7045
LUT 151,674 184,993 218,600
LUTRAM 12,242 15,317 70,400
FF 213,761 256,665 437,200
BRAM 958 1089 1090
DSP 594 718 900
Table 5.1: Resources post-implementation
is described on Table 5.1. This was also combined with the design from [19] and both units
were successfully tested working side-by-side, setting the target frequency to 100 MHz .
Fig. 5.11 demonstrates the resource to performance scaling, post-implementation, for the pre-
sented design. The graph contains some examples ranging from a lower performance point up
to the 60 fps design point which was selected as the best performance candidate to allow a
second accelerator to fit in the larger ZC706 board. To raise the performance from 30 fps to 60
fps the main change is the increase of processing units, doubling the amount of Keypoints we
can process per cycle. As we have discussed in Section 2.4, under computational cost, this task
has a complexity of O(n) scaling with the number of Keypoints processed. So having tuned the
rest of the parameters (gradient threshold, resolution) for a good accuracy and robustness level
for our dataset, there is an average number of Keypoints to process, and doubling the number
of Keypoints processed per unit of time will give close to double the performance.
It should be noted for figure Fig. 5.11 that, while scaling from 15 to 30 to 60 fps is obtained
by increasing processing units until we can process on average twice the Keypoints per second
the 32.5 and 42.5 design points reflect a different goal. Due to the nature of the mapping
function, as we have discussed in previous sections, we allow some increase in latency to achieve
a more optimal resource to framerate ratio. These two design points reflect increasing many
intermediate resources to guarantee a better worst case behaviour. This will require a significant
increase in processing units and registers but offer only slightly better performance for the
average case.
150 Chapter 5. Accelerating Mapping for SLAM
15fps 30fps 32.5 fps 42.5 fps 60 fpsTarget Performance at 100MHz (frames/sec)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7Re
sour
ces r
atio
com
pare
d to
Zyn
q-70
45LUTsFFsDSPs
Figure 5.11: Resource scaling with architectural tuning targetting 100 MHz
We can see in the figure that different resources scale with different rates, owing to the fact
that the scalability so far targets only compute units placement, guided by the integer initiation
interval in High-level Synthesis while for now the interfaces to memory, cache sizes and lines
and inter-block communication are assumed static. As such, some resources such as the DSPs
which are ubiquitous in most compute units scale almost linearly with the target performance,
followed by the LUTs. On the other hands, Flip-Flops have a standard offset cost owing to
their extensive use in I/O and other fixed registers and their change in utilization reflects more
relaxed pipelining when targeting lower processing rates.
5.4.4 Performance and Power Comparison
In real world testing with two datasets from TUM (Room and Machine Hall)2, 98% of frames
were within one millisecond of the target processing time and more than 99% were within two.
The accelerator achieved a performance of more than 60 frames per second, on-par with a
hand-optimised, multi-threaded implementation on a high-end desktop CPU. There were some
2Room and Machine hall datasets supplied by the authors of LSD-SLAM on TUM’s website:https://vision.in.tum.de/research/vslam/lsdslam
5.4. Evaluation 151
outlier cases with a performance drop of up to 30%, from 16.3ms to 21 ms. For example in
the Machine hall dataset, one of the two depicted in the Evaluation Section, out of 3268 map
updates only 19 experienced a significant delay, many of them in the initialisation phase, with a
maximum recorded value of 20.9ms. However, that is considered acceptable in this application
for two reasons. Firstly, even in the case of accumulating a delay for successive updates, the
application can dismiss one dropped map update out of one thousand without a noticeable
degradation and can handle a lower mapping rate than the one targeted with the presented
design.
Secondly, as has been mentioned it delivers a performance far more stable than the software
versions. On Fig. 5.12, we can see a violin plot of the mapping performance (total processing
time for a map update step) on three high-end platforms across the two datasets. In this plot,
the colour corresponds to the platform, from left to right an Intel i7-4770 at 3.77 GHz, this work
at 100 MHz on a Zynq-7045 and the Cortex-A57 on a Tegra TX1 running at 1.73 GHz. The
width of the shape corresponds to the density of observations around a particular msec value,
similar to a sideways kernel density plot. The thicker, white line in the middle corresponds to
the mean value of the observations, while the thinner orange one to the median value. Finally
the lines at the top and bottom are the actual minimum and maximum value observed.
The figure demonstrates the variability of this processing load on general purpose hardware,
and how robust the proposed mapping accelerator is to these delays, appearing almost flat since
most of the observations were very close to the ideal value of approximately 16.2 ms at 100 MHz.
The reasons for this have to do with two main aspects of computation and communication;
what is the effective bandwidth of the platforms over the available and how it is overlapped
with computation. As this is a memory intensive application, cache misses and other memory
inefficiencies will cause significant delays to a general purpose CPU platform, especially in light
of multiple threads running at the same time and polluting the cache for each other. This effect,
as can be verified from the test data in Fig 5.12, will be exaggerated in the mobile platform
with smaller and simpler cache memories and predictors and smaller out-of-order execution
capabilities.
152 Chapter 5. Accelerating Mapping for SLAM
Room - PC Machine Hall - PC Room - This Work Machine Hall - This Work Room - Tegra TX1 Machine Hall - Tegra TX10
50000
100000
150000
200000
250000m
seco
nds
Figure 5.12: Mapping performance in msecs - Different Platforms / Datasets
In contrast, the hardware architecture presented in this chapter is designed specifically for this
application and takes advantage of the high bandwidth memory interface available, scheduling
reads at the best efficiency possible and hiding most memory latencies, while also caching the
randomly-accessed data in advance. It also overlaps computation and communication very
effectively and though operating at a lower frequency than both CPUs, it offers orders of
magnitude more compute units than a general purpose processor, performing significantly more
operations per cycle. In this way, its performance does not depend strictly on the number of
points to process, unless their concentration and number significantly exceeds what is expected.
In the CPU performance varies a lot in a frame by frame basis. Simple cases where there are
not many points to perform scans for, and where they might be concentrated in one area might
give better performance from simply having fewer net CPU instructions and fewer cache misses.
On the other hand, cases where there are many points will keep increasing CPU instructions
linearly, and their spread may tax the memory subsystem more, leading to a large increase in
computation time.
In addition to the performance, power consumption samples were collected for each platform,
5.4. Evaluation 153
i7-4770 @ 3.77GHz Tegra TX1 (Cortex-A57 @ 1.73GHz) This work @ 100MHz This work @ 50MHz
Power consumption during execution measured at wall
0
10
20
30
40
50
60
70
80
90
100
110wa
ttsStatic/idle board powerDynamic at runtime
Figure 5.13: Power consumption of the devices tested. Here, “This work” refers to the combinedpower of tracking and mapping accelerators and the CPU operating for the background tasksof SLAM.
using the current draw from the socket, so it includes all the peripherals plus the power supply
losses. The power results are collected while executing the full SLAM algorithm, so in the case
of the FPGA-SoC include both the mapping accelerator and a version of the accelerator for
tracking from Chapter 4. It is presented here separating static from dynamic power consump-
tion, to make it clearer how much the chips contribute at full load, versus board losses and
static power where unused peripherals would be a significant contributor as well. The measure-
ment is accurate to ±0.5 W. In the case of the presented accelerator it is possible to estimate
approximately 0.98 W of the static power draw to be because of the FPGA itself, according to
the Xilinx estimator in Vivado. Testing power draw with an unprogrammed FPGA, showed a
decrease in static power of approximately 1-2 W adding merit to this. A performance on par
with the high-end desktop CPU is measured, but for an order of magnitude less power for the
FPGA fabric to function.
We can see that the FGPA development board is at a similar power level at full load with
the Tegra TX1, but with more than a 4x increase in performance on average for the presented
accelerator design. The estimate for the chip power for the mobile CPU + FPGA fabric on the
154 Chapter 5. Accelerating Mapping for SLAM
Zynq-7045 is 6.5 watts using the Xilinx tools on Vivado post-implementation. Combined with
the results shown on Fig. 5.13, we can estimate the portion of the power requirements that are
due to the power supply and unused board peripherals. Static power figures are high for the
FPGA board since it includes several unused devices on the FPGA board, including SPI flash
and a second unused DDR chip that is only accessible from the programmable logic. On the
Tegra the GPU was set to run at idle clocks so that the power should reflect mainly the CPU’s
consumption with the added requirements for off-chip DRAM and power supplies on top of the
static draw from the GPU resources at idle voltage/frequency levels.
The aim of the accelerator, together with the work in Chapter 4, is to provide a complete
acceleration solution for LSD-SLAM, a state-of-the-art semi-dense SLAM method. The two
designed architectures both achieve real-time performance, evaluated running LSD-SLAM with
a pre-recorded dataset utilizing the two accelerators, with a board power draw at the wall
of approximately 15 W and an estimated chip power consumption of 6.5 W. So far we have
compared the performance of the accelerator to that of the software implementation executing
in an embedded and a desktop-grade CPU. Table 5.2 presents some representative examples of
the current state of the art both in SLAM algorithms as well as typical embedded solutions,
and is a reduced version of the Table at the end of Chapter 2.
5.4. Evaluation 155
Wor
kT
yp
eH
ardw
are
Pla
t.D
ensi
tyC
lose
-loop
Iner
tial
Typic
alP
ower
OR
B-S
LA
M[4
1]SL
AM
Lap
top
CP
USpar
seX
38-4
7W
LSD
-SL
AM
[17]
SL
AM
Lap
top
CP
USem
i-den
seX
40-5
0W
Whel
anet
al.
(RG
B-D
SL
AM
)[3
6]SL
AM
GP
UA
ccel
erat
edD
ense
X17
0-25
0W
Th
isw
ork
SL
AM
FP
GA
SoC
Sem
i-den
seX
6-7W
Leu
teneg
ger
etal
.[4
5]SL
AM
Lap
top
CP
USpar
seX
X30
-50W
SV
O[5
7]O
dom
etry
Lap
top/J
etso
n-T
x1
Spar
se30
-40W
/10-
15W
FP
GA
Acc
.of
OR
BE
xtr
acti
on[6
1]K
ernel
Acc
.F
PG
ASpar
se5.
3W
Nav
ion
[60]
Odom
etry
ASIC
-65
nm
CM
OS
Spar
seX
2-24
mW
Tab
le5.
2:Sta
te-o
f-th
e-ar
tSL
AM
exam
ple
s.C
ompiled
wit
ha
focu
son
feat
ure
san
dch
arac
teri
stic
sof
diff
eren
tso
luti
ons
todem
onst
rate
the
bre
adth
ofth
efiel
d.
This
isa
sim
ple
rve
rsio
nof
the
Tab
lein
Chap
ter2
.
156 Chapter 5. Accelerating Mapping for SLAM
The table is not meant to be exhaustive or rank the works. Instead, it was compiled to focus
on the characteristics of different solutions and provide an overview of different software and
hardware approaches to SLAM and their power characteristics3 The key takeaway is the gap
between fast but sparse odometry with no large-scale capabilities or loop-closure on embedded
systems and accurate, complex and dense solutions occupying different positions on the SLAM
landscape but requiring high-end hardware for real-time operation.
The second column indicates if the work attempts the whole task of tracking and mapping, with
a large-scale map, or simply tracking a trajectory (Odometry) or is an accelerator of a specific
kernel or operation meant to operate alongside a general purpose hardware running a SLAM or
Odometry algorithm. The next column indicates which hardware platform it targetted, with
the first three works being purely algorithmic works, targetting different map density outputs,
the next two after the work presented in this thesis specifically targeting lower-power platforms
and the last two rows custom hardware versions. Following, the qualitative sparsity level of the
works is presented, if they have large-scale capabilities and the ability to close loops (making
a large-scale map coherent) and if the work includes in this case an Inertial sensor as input as
well.
The final column is the typical power of the platform, which unless mentioned in the paper
is estimated based on the hardware used for typical loads as discussed in the footnote. Our
work, as we discuss in the next section, stands to provide a complete solution to bridge the gap
between the latest research in advanced SLAM algorithms, such as the three examples above
it in the table, and the existing work in embedded SLAM such as the two below. With the
two presented accelerators, the achieved performance-per-watt and latency can bring full semi-
dense direct SLAM capabilities, at a high performance, to the power envelope of an embedded
low-power device.
3The power figures were often not mentioned in works, or measured with varying methods. Thus, in theinterest of providing a qualitative view, a typical expected power is included for the chip/platform mentionedin the publications (e.g. nVidia 680GTX, Jetson TX1, Intel i7-4700MQ etc.). For the presented work, theestimated chip power is reported instead of the board power to be in line with other papers.
5.5. Conclusions 157
5.5 Conclusions
This chapter presents an FPGA-based architecture that achieves the required performance to
run high quality state-of-the-art semi-dense SLAM with high-end desktop performance at the
power level of an embedded device. It has good scalability and is parametrised to address various
SLAM specifications and target different FPGA-SoC devices, demonstrated by successfully
running alongside the accelerator presented in Chapter 4. Combined with the work presented
in Chapter 4, the accelerator presented here finally closes the loop of SLAM, providing high
performance and low-latency in an low-power package for both of the interdependent real-time
tasks constituting real-time, semi-dense SLAM.
The main findings of this work are that the most efficient designs for the target application
combine features that include a high-bandwidth streaming interface to common memory and
local caching of the region of interest or, if possible, the entire image frame processed, which is
tailored to the processing that needs to be performed. Dealing with the complex control-flow
of these algorithms a great fit was found to be multi-rate, multi-modal units, separated by
buffers. The most efficient and high performance choice was also found to be a pipeline design
that follows the dataflow paradigm, trying to move every data point through once. To have a
highly efficient design, memory accesses were separated from the rest of the computation, with
a stream of data flowing through the hardware carrying their control parameters as metadata
with them along the processing path.
5.5.1 Achievements of Thesis
In Background and more in-depth in Chapter 3, we discuss the principles of semi-dense SLAM,
why it needs high-performance operation and what needs to be targeted for acceleration towards
achieving that in an embedded, power-constrained environment. The nature of a state-of-the-
art semi-dense SLAM combines characteristics that make the performance-per-watt and fea-
tures of general purpose hardware insufficient. This research produced two specialised, custom
hardware architectures that, implemented on an FPGA-SoC, have achieved real-time tracking
158 Chapter 5. Accelerating Mapping for SLAM
and mapping performance at an order-of-magnitude better performance per watt than general
purpose CPUs.
At the beginning of this research, the state of the art in embedded SLAM consisted of sparse,
feature-based algorithms, often with reduced capabilities even compared to state-of-the-art
sparse SLAM such as [41], and usually constrained to simple trajectory estimation as odometry
applications. In contrast the accelerators presented here, while targetting the same power
constraints, have provided a high level of performance with no compromise in the quality of the
result for the two real-time components of state-of-the-art semi-dense SLAM. This work targets
a complete SLAM algorithm, with loop-closure, high-density reconstruction and large-scale
optimisation and stands to enable significantly richer and more advanced SLAM for embedded
platforms.
This research makes an important step on the way to realising many emerging applications in
robotics and augmented/virtual reality such as those discussed in Chapter 1. It was done with
the goal of offering the performance and power-efficiency necessary for a semi-dense SLAM to
operate in real-time but in a platform that can target different projects and devices. Meanwhile,
the lessons that emerged from it can be used to guide future hardware design in this space
towards embedded environment understanding, not limited to reconfigurable hardware but
offering a direction to guide domain-specific ASICs and perhaps even targeted optimisations
for general-purpose embedded hardware to enable a better performance-per-watt.
Moreover, because it was based on the idea of combining specialised hardware tightly inte-
grated with a mobile CPU running a full operating system, it can provide a base platform
to add capabilities to and extend for many different research projects. It can act as a self-
contained architecture, and therefore help close the gap between the FPGA community and
the algorithms/robotics communities, by making it easier to reuse this work even for researchers
without previous experience with reconfigurable platforms.
Chapter 6
Conclusions and Future Work
This thesis has so far explored the diverse field of Simultaneous Localisation and Mapping, dis-
cussed the gap between the capabilities of embedded SLAM and the state of the art in SLAM
algorithm research, and presented two high performance architectures that bridge this gap in
the context of high-performance direct, semi-dense SLAM with power-efficient custom hard-
ware acceleration. This was done in the context of developing platforms, such as autonomous
quadcopters and ground robots and augmented reality, with a greater degree of environment
awareness, while overcoming the challenge of low-power constraints they often come with.
The work presented in this thesis has addressed both parts of the research question, stated in
Section 1.4. On one side, it demonstrated successfully that it is possible to design a custom
hardware architecture to achieve advanced real-time SLAM in power constrained embedded
platforms. On the other, towards the design of such an architecture, a number of novel archi-
tectural and micro-architectural features and components have been presented as part of two
custom architectures that achieve the desired performance and power requirements. So far we
have discussed the main features and requirements for designing such an architecture. In the
following two sections we will discuss the main lessons drawn from this work, how they can be
generalised to apply to similar problems, and the proposed research directions for future work.
159
160 Chapter 6. Conclusions and Future Work
6.1 Lessons learnt designing with HLS and FPGA-SoCs
This section will present lessons drawn from the experience of designing custom hardware archi-
tectures for this type of application, as well as from exploring different designs and implementing
them using High-level synthesis and FPGA Systems-on-Chip.
Rapid design exploration and implementation with HLS
While the use of C/C++ to RTL synthesis tools, such as those used in this work, can signifi-
cantly shorten time of implementation giving the benefits we discussed in chapters 1 and 2, it
comes with two significant constraints.
1. Scheduling of operations, pipeline stages and processing units are automatically
inferred. This means that this part of the design is partially out of the hands of the designer.
Hence, in the cases where this leads to a series of inefficient design choices, it is not straight-
forward to amend. Organising HLS code in short loops and separating them by interfaces that
the tool will not automatically flatten can partially overcome the problems this causes. A way
to achieve this is using streaming FIFOs or hierarchy levels with flattening explicitly turned
off.
This creates a modular coding style, and leads to the generation of RTL that is easier to verify.
At the same time, the tool has to work at a smaller scheduling and state space, improving
the potential of identifying an efficient solution. These factors make it more likely to identify
problematic C/C++ that infers less than optimal hardware and easier to refactor that code to
end up with the desired RTL. In this context, dataflow architectures proved very useful for
efficiently combining this design principle with data-intensive algorithms such as SLAM.
2. Tools assume a one cycle latency for any interface access, even if this is connected
to a busy interconnect and through to an off-chip DRAM. This means that “tool reports”, that
estimate the latency and performance of the RTL, may miscalculate by one to two orders of
magnitude depending on the underlying architecture and generate highly inefficient memory
6.1. Lessons learnt designing with HLS and FPGA-SoCs 161
interfaces.
This can be overcome by using dedicated for-loops in HLS that perform sufficiently large se-
quences of serial accesses from memory to local buffers, independently of the rest of the RTL.
These can be utilized to generate more efficient types of accesses to a memory interface, such as
burst transactions of a desired length and width. Such a loop should be packaged in a dedicated
I/O unit and be paired to a sufficiently large buffer to place the elements that are read or ready
to be written out. Anticipating the algorithm’s data needs is essential to efficiently use such
a memory interface. Moreover, knowledge of the interconnect and I/O of the chip should be
taken into consideration to optimise memory accesses for a specific SoC, tuning word size and
burst size for optimal results.
Debug and verification of HLS-generated RTL
HLS tools have a big advantage in verifying the C/C++ code before it goes through a “Syn-
thesis” step to be converted to Verilog or VHDL. It can be compiled with a software compiler,
including debug symbols and run on the development computer in a debugger, with the ad-
vantages of state visibility, breakpoints etc. However, in this author’s experience, there are
significant challenges in verifying the underlying HLS-generated RTL.
While there was an RTL simulation included in the tool, it would not work for any non-trivial
design, instead crashing or freezing. In both cases, early cancellation or any other error would
result in no output of waveforms or other information, meaning there is no way of knowing
what was the cause. Meanwhile, valid HLS/C code, with behaviour identical to the software
when simulated as software, would often successfully synthesize to RTL that would then freeze
or corrupt its output when run on the FPGA with real data. Moreover, as the accelerators were
designed to operate in the context of an SoC with a full operating system and hardware/software
co-processing, the RTL under development could only be tested extensively when running in
that context, making bare-metal development tools incompatible.
While the above experience can be attributed partially to the immaturity of the tools, it is
162 Chapter 6. Conclusions and Future Work
not expected to change significantly in the near future and therefore makes it worth discussing
alternative solutions to support current efforts. A useful practice to facilitate the debugging
of what essentially becomes a black-box, because of the above issues, is the following: during
the development of a design, one should include extra “debug” ports in the RTL to enable
in-system debug, an idea also used to debug ASICs in silicon. A number of slave-type ports
can be used to expose the current value of key internal registers, to be read when the hardware
freezes or periodically. In addition, a master-type port can be connected from the RTL to the
off-chip memory and export internal state and the entire content of caches when triggered, or
in pre-determined intervals.
Using the two types of ports described above, along with careful selection of the state to export,
can assist towards the identification of potential issues not visible when run under a debugger,
or with limited testbenches. A useful side-effect is that the same ports and debug tools can be
used while running with real-world data in lock-step with a software solution, in order to check
the numerical accuracy of the hardware more extensively at different stages of the algorithm.
Good development practices
Finally, some other good practices that will assist in future work using HLS and FPGA-SoCs
are included here. The ideal scenario of cooperation should include a system-interrupt that
can be linked to the RTL. This way the CPU can have the maximum amount of cycles free
for other useful computation instead of acting as a controller or polling the hardware. Due
to significant latency in the various forms of communication between CPU and FPGA, tasks
that are under a few thousand CPU cycles and need HW/SW synchronisation are not good
candidates for acceleration, unless batch processing is possible (i.e. performing multiple tasks
in parallel). Memory coherency can become a significant issue for HW/SW cooperation under
an operating system and needs attention in such architectures.
6.2. Generalisation of the presented research 163
6.2 Generalisation of the presented research
Main application parameters
As discussed in section 2.4 and throughout this thesis, the performance of SLAM algorithms
is affected by key parameters, where it is possible to increase performance at the expense
of quality, or vice-versa. In the context of the algorithms we have discussed here the main
parameters are tracking and mapping resolution, point selection thresholds (such as maximum
gradient, validity score), and the tuning of noise and damping parameters in the optimisation
process of tracking and regularisation parameters used in the mapping function. While different
algorithms might have slightly different parameters to tune, computer vision and SLAM are
probabilistic algorithms dealing with noisy and imprecise data and attempting to recreate
a finite model of a world with infinite detail. As such, there will always be a tradeoff of
computation, density and quality to performance, attempting to optimise an algorithm to a
specific application and implementation.
The tuning of these parameters, while crucial in SLAM, is orthogonal to the work presented
in this thesis. The presented architectures are efficient for a large range of resolutions and
algorithm parameters, meaning that any improvement due to parameter tuning will be valid
for both the software and the hardware version of the functions involved. As such, the best
solutions will come from combining this research with optimising the algorithmic parameters
for specific environments and applications.
Scaling custom architectures with different algorithmic parameters
The above parameters will not affect efficiency but will affect performance indirectly in two
ways. Firstly, most of them will influence the total number of Keypoints to process, either
through resolution changes or selection criteria. Secondly, better accuracy and quality will
often lead to faster convergence meaning fewer optimisations steps and hence less computation
for tracking. In practice, this will not offset the upfront increase in computational load to
obtain that level of accuracy.
164 Chapter 6. Conclusions and Future Work
However, by carefully balancing the specific requirements of an application, that tuning can
be combined with a reduction of the mapping or tracking resolution used. This might not
degrade accuracy and quality to unacceptable levels for that application, but will require less
memory for caching and will bring almost as much a reduction in logic as well to achieve similar
performance levels for both the tracking and mapping task.
Bottlenecks to arbitrarily scaling this architecture on an FPGA-SoC
Scaling to a significantly larger FPGA would provide the architectures presented here with more
processing units to utilize to significantly accelerate all parallelizable computation tasks, but
would face two main bottlenecks. The first has to do with the caches used in this architecture,
while the second is related to off-chip bandwidth.
The first problem is that both tracking and mapping access random points on a frame cache.
After sufficiently improving the computational performance, the algorithm will need to access
multiple random windows in these caches in a single cycle. In a straightforward implementation
that would require scaling to multiple port RAMs, or even a possible duplication of caches. By
taking advantage of the application requirements, for example, that these caches are written
once and read multiple times, other solutions with multiple level caches and more complicated
control for sharing the same resource could be leveraged. While there are many potential
solutions to consider that will eventually push that bottleneck further, the fact is that the
cache designs we have so far considered will need to be replaced with larger-area-and-power
caches of higher complexity to provide sufficient bandwidth.
The second is a problem shared by many current high-performance hardware architectures such
as high-end general-purpose CPUs and GPUs currently. It is easier to add more compute in a
centralised location than feed it with the data it needs from another location on- and off-chip.
As the processing capabilities of the accelerators we discussed scale and caches get larger to
fit larger amounts of data, the resulting hardware on chip demands higher bandwidth from
off-chip sources. However, the latency and bandwidth of memory interfaces to off-chip DRAM
and the DRAM chips themselves have improved slower than the available compute on-chip, a
6.2. Generalisation of the presented research 165
trend that could continue in the future. As such, off-chip bandwidth may become a significant
bottleneck and have to be addressed by alternative solutions in the future.
On designing future systems
Drawing on the lessons learnt during the past few years’ research on this topic, the main focus
should be firstly on memory bandwidth , both on- and off-chip. As we discussed in the previous
paragraphs, memory bandwidth and cache design will become a crucial factor earlier in scaling
up. Considering the experience of designing and implementing the presented architectures, a
well-designed interconnect on an FPGA-SoC should be established first, as the logic can then
be utilized to its maximum potential, while the reverse will mean that in applications such as
SLAM there will always be underutilized logic.
Moreover, an ideal architecture would be sharing caches between accelerators and general-
purpose compute with some coherency protocol implemented in hardware. This would provide
an order-of-magnitude improvement in tasks such as synchronisation and co-processing latency,
as well as raw bandwidth for accessing key shared data structures. It would also allow compute
capabilities and therefore the frame-rate, resolution and density characteristics, to scale up
better compared to current system architectures in off-the-shelf FPGA-SoCs.
Deploying these architectures on devices with higher resource constraints
The presented research has been implemented and tested targetting a 5-10 W power consump-
tion. This was done for two reasons. Firstly, most other mobile platforms that have attempted
advanced SLAM had a similar or higher power consumption and as has been demonstrated
significantly less performance. Secondly, it was imposed by the development boards that were
available to conduct experiments. As it stands, the FPGA-SoCs we prototyped on could directly
be used with minimal modification on a real robotic system.
However, with its weight and power consumption, the system tested in our experiments would
be limited to larger ground robots. Using a smaller off-the-shelf board with a reduction in
166 Chapter 6. Conclusions and Future Work
performance or resolution could be used on larger, more powerful quadcopters. However, a
significant improvement could be obtained by integrating the FPGA-SoC into the system board
of the robot. This would reduce the power losses and large increase in weight brought by the
use of development boards. At that point it would already be an efficient solution for large
robots or vehicles, aiming to accelerate or offload state-of-the-art SLAM capabilities.
However, the lessons learnt here and the general architecture can both be transferred to the
development of more power constrained systems. Firstly, the principles we have presented for
streaming operations, dataflow processing, deep pipelines and the importance of caching hold
for any type of SLAM system, especially as the resolution and density increases. For semi-dense
SLAM, our architecture could fit in a smaller more power efficient FPGA, directly integrated
with a state-of-the-art CPU on a custom SoC using the latest semiconductor process technology.
Such a custom SoC, specialised for the application, would achieve real-time performance while
improving power consumption significantly compared to off-the-shelf FPGA-SoCs that include
many unnecessary components on-chip. Further reducing parameters such as resolution and
density can allow for the use of smaller logic to further improve power and a more power efficient
CPU, allowing the deployment to most battery power devices. Meanwhile, re-implementing
part of the architecture in a lower-level HDL such as Verilog and taking advantage of vendor-
specific features present in state-of-the-art FPGAs can improve the achievable frequency of the
presented architectures after place and route, providing a further performance improvement.
Finally, for applications in micro-aerial vehicles or augmented reality, where the weight and
power of an FPGA could still prove too high, the principles and architectures presented here
could guide the design of a completely custom ASIC. Such a specialised chip could implement
a heterogeneous platform in the form of dedicated accelerator hardware tightly integrated next
to, or even as part of, a general-purpose CPU with dedicated I/O for the application. This kind
of solution would provide at least an order of magnitude improvement in power and area for
the same performance compared to an FPGA fabric at the cost of significantly higher design
effort.
6.3. Research Conclusions 167
Upcoming advances in hardware that stand to significantly benefit this architecture
We have already discussed in this section the most important changes and advances towards
significantly improving what is achievable with this type of architecture, both in the context
of FPGA-SoCs as well as custom chips. Since a lot of the focus is on memory bandwidth
and latency, a concern shared with most high-performance digital hardware currently devel-
oped, solutions from that space if ported to FPGA-SoCs will provide a significant improvement
and allow an improved implementation of the design and research directions discussed in this
chapter.
The most important of these anticipated developments are 3D stacking of logic and High Band-
width Memory (3D stacked SDRAM chips with much wider, lower latency and energy commu-
nication buses). These stand to significantly improve memory bandwidth, as well as latency and
bandwidth in communication between different modules in a heterogeneous system. This type
of technology fits well with the application of the principles discussed in this chapter towards
higher performance and efficiency.
6.3 Research Conclusions
The first conclusion drawn from the work presented in this thesis is that the field of hardware
design is entering an era where hardware and software co-design is becoming more and more
crucial to achieve highly efficient and performant designs. One of the main bottlenecks in the
work presented in Chapter 3 comes from attempting to accelerate a task that was developed
using software engineering principles disconnected from the underlying hardware. Redesigning
parts on the software level, in the context of both general purpose hardware, and especially with
the existence of a custom hardware accelerator operating in the same memory space, can sig-
nificantly improve the efficiency of many tasks without changing the principles or functionality
of the underlying SLAM algorithm.
Focusing on SLAM, a main conclusion is that in this family of algorithms, data movement
168 Chapter 6. Conclusions and Future Work
is a crucial aspect of the achievable performance of a custom hardware accelerator and its
power efficiency. SLAM is data intensive and, at this point in time for digital hardware,
data movement off-chip is one of the most power-consuming and high-latency operations. In
this context, that SLAM ideally requires a combination of specially-designed caches or local
memories storing information accessed in a random manner and dedicated high-bandwidth
streaming direct memory access for the rest of communication.
Since locality is weak and the access patterns are not predictable for the random accesses, tra-
ditional cache-design becomes less efficient. Instead, recognising that this randomly accessed
information is only pixel-intensity information and thus has a smaller size compared to the
rest of the processed data that are accessed in a more predictable manner, a better solution in
terms of performance and efficiency was found to be designing partitioned, multi-port memo-
ries specifically for this information and supporting single-cycle access for square windows of
neighbouring pixels, that are a very common access pattern for both tracking and mapping in
SLAM. This includes on-the-fly gradient calculation operations, finding min, max and abso-
lute in a small area and interpolating pixel intensities around a point in the camera’s frame.
Finally, recognising that the rest of the information is usually accessed in a row-by-row ba-
sis, dedicated I/O units with buffered burst transactions become a natural choice for efficient
high-performance designs.
This is also important for the next main conclusion of this work. Because of the combination of
a high-amount of operations per “Keypoint” for SLAM with non-trivial control flow and vari-
able latency for some tasks, it turns out that an architecture following the dataflow paradigm,
as described here, produces a very efficient hardware design. Separating computation in well-
optimised, self-contained hardware blocks, connected with streaming buffers to each other with
a clear input and output and their own control internally results in a better fit for SLAM.
Building on this, one can design multi-modal, multi-rate and variable latency hardware units,
utilizing the buffered communication to absorb delays and thus matching the variable latency
tasks inherent in semi-dense SLAM. This can offer high performance together with high util-
isation and therefore power efficiency. Instead of treating the applications characteristics as
challenges, in this case they can be embraced as opportunities to improve the flexibility and
6.4. Future Work 169
efficiency of the custom hardware generated.
Last but not least, a design utilising the dataflow paradigm and smaller well-contained hardware
blocks, instead of a monolithic pipeline, offers another advantage. Especially in the light of
high-level synthesis languages, but not limited to those, such a design allows the designer to
better optimise each unit separately and improves the design process, both on the conceptual
and the tool level. Moreover, by using high-level synthesis it becomes possible to enable faster
and easier design-space exploration, by varying the compute unit allocation for each hardware
block and utilising the tools to search for the best scheduling of the necessary operations on
the allocated hardware units.
6.4 Future Work
Lastly, in this section we will discuss ideas, opportunities and research directions for the archi-
tectures presented in this thesis, the field of embedded high-performance low-power SLAM in
general and how the conclusions we discuss here can be applied in the future.
In the short-term, there are certain directions or enhancements that have not been explored yet
in the context of this research, but can provide different benefits to the presented architectures.
Firstly, scalability as discussed above was explored in the context of compute units, which for
the targetted family of devices (Xilinx Zynq) was sufficient to target different off-the-shelf SoCs
and resource budgets. However, especially in the larger context of different FPGA-SoC devices
or ASICs, a separate investigation can take place for different designs to scale the off-chip
memory interfaces, the communication buffers between hardware blocks and the on-chip frame
caches. This will introduce more variables in the design-space exploration that should be able
to enhance the benefits of the scalability techniques we have already proposed.
In the realm of custom hardware, the performance and resource advantages of using reduced
precision and custom data types are well researched. Meanwhile, SLAM algorithms deal with
imprecise data and probabilistic estimates for the environment. Therefore, building on top of
the architectures presented in this thesis, an improvement can be realised by using a mixture of
170 Chapter 6. Conclusions and Future Work
normal and reduced precision compute units and custom numerical representations. By taking
advantage of domain-specific knowledge, a carefully designed implementation of the above can
further enhance the performance and efficiency of the proposed architectures, while limiting the
drop in accuracy or quality by tailoring them to the sensors and computation used in SLAM.
In the longer term, there are two promising research directions in the view of this author. The
first has to do with using the presented architectures and their derivatives as a base, to explore
the benefits of having state-of-the-art SLAM functionality self-contained on-board an embedded
platform. By combining existing lightweight robotic platforms, such as quadcopters, with the
proposed architecture realised on an FPGA-SoC, it becomes possible to enable autonomous
environment exploration in much smaller and more agile robots than before. Progress in this
front can feed back to developments both in SLAM, and in realising more advanced applications
of this type, opening up exciting research avenues in this direction.
The second direction has to do with the evolution of SLAM to a more complete version of
environment awareness. New research directions combining SLAM with advances in machine
learning and with ideas for multiple levels of map representation, stand to provide leaps of
improvement in the quality and level of environment understanding. For example, using the
depth map of a SLAM algorithm as input together with the camera pixels for a deep convolu-
tional neural network can produce a prediction of the object boundaries and labels for different
objects in the scene. This could then potentially be fed back into SLAM to further refine the
generated depth map and detect outliers with greater precision.
By processing the highly demanding tasks of tracking and mapping on dedicated hardware
with the architectures described here, the general purpose CPU and potentially GPU of an
embedded system are free to be used for other levels of awareness such as scene labelling and
obstacle avoidance. Moreover, advances in reconfigurable hardware can be utilised to design
dedicated hardware for the tasks of machine learning as well, as they become more developed in
the research community, and combine these levels of understanding on a single custom hardware
unit. In this way, by breaking down the boundaries between these different algorithms, large
improvements in efficiency and performance can be gained with research into how to combine
6.4. Future Work 171
the different knowledge representation and how to best take advantage of data locality while
image information streams through the same system-on-chip.
Bibliography
[1] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry for a monocular camera,”
in Proceedings of the IEEE International Conference on Computer Vision, pp. 1449–1456,
2013.
[2] Xilinx, “Zynq-7000 Technical Reference Manual (User Guides).” https://www.xilinx.
com/support/documentation/user_guides/ug585-Zynq-7000-TRM.pdf, 2018. [Online;
accessed Feb-2019].
[3] J. Sturm, K. Konolige, C. Stachniss, and W. Burgard, “Vision-based detection for learning
articulation models of cabinet doors and drawers in household environments,” in Robotics
and Automation (ICRA), 2010 IEEE International Conference on, pp. 362–368, IEEE,
2010.
[4] B. Krose, R. Bunschoten, S. Hagen, B. Terwijn, and N. Vlassis, “Household robots look
and learn: environment modeling and localization from an omnidirectional vision system,”
IEEE Robotics & Automation Magazine, vol. 11, no. 4, pp. 45–52, 2004.
[5] M. Faessler, F. Fontana, C. Forster, E. Mueggler, M. Pizzoli, and D. Scaramuzza, “Au-
tonomous, vision-based flight and live dense 3d mapping with a quadrotor micro aerial
vehicle,” Journal of Field Robotics, vol. 33, no. 4, pp. 431–450, 2016.
[6] A. J. Barry and R. Tedrake, “Pushbroom stereo for high-speed navigation in cluttered
environments,” in Robotics and Automation (ICRA), 2015 IEEE International Conference
on, pp. 3046–3052, IEEE, 2015.
172
BIBLIOGRAPHY 173
[7] M. Saska, J. Chudoba, L. Precil, J. Thomas, G. Loianno, A. Tresnak, V. Vonasek, and
V. Kumar, “Autonomous deployment of swarms of micro-aerial vehicles in cooperative
surveillance,” in Unmanned Aircraft Systems (ICUAS), 2014 International Conference on,
pp. 584–595, IEEE, 2014.
[8] S. Waharte and N. Trigoni, “Supporting search and rescue operations with uavs,” in Emerg-
ing Security Technologies (EST), 2010 International Conference on, pp. 142–147, IEEE,
2010.
[9] C. Zhang and J. M. Kovacs, “The application of small unmanned aerial systems for pre-
cision agriculture: a review,” Precision agriculture, vol. 13, no. 6, pp. 693–712, 2012.
[10] J. Janai, F. Guney, A. Behl, and A. Geiger, “Computer Vision for Autonomous Vehicles:
Problems, Datasets and State-of-the-Art,” arXiv e-prints, Apr. 2017.
[11] S. Lee, S. Lee, and J. J. Yoon, “Illumination-invariant localization based on upward looking
scenes for low-cost indoor robots,” Advanced Robotics, vol. 26, no. 13, pp. 1443–1469, 2012.
[12] B. Vincke, A. Elouardi, and A. Lambert, “Real time simultaneous localization and map-
ping: towards low-cost multiprocessor embedded systems,” EURASIP Journal on Embed-
ded Systems, vol. 2012, no. 1, pp. 1–14, 2012.
[13] J. Sturm, E. Bylow, C. Kerl, F. Kahl, and D. Cremer, “Dense tracking and mapping with
a quadrocopter,” Unmanned Aerial Vehicle in Geomatics (UAV-g), Rostock, Germany,
2013.
[14] M. Blosch, S. Weiss, D. Scaramuzza, and R. Siegwart, “Vision based mav navigation
in unknown and unstructured environments,” in Robotics and automation (ICRA), 2010
IEEE international conference on, pp. 21–28, IEEE, 2010.
[15] H. Wong, V. Betz, and J. Rose, “Comparing fpga vs. custom cmos and the impact on
processor microarchitecture,” in Proceedings of the 19th ACM/SIGDA international sym-
posium on Field programmable gate arrays, pp. 5–14, ACM, 2011.
174 BIBLIOGRAPHY
[16] I. Kuon and J. Rose, “Measuring the gap between fpgas and asics,” IEEE Transactions on
computer-aided design of integrated circuits and systems, vol. 26, no. 2, pp. 203–215, 2007.
[17] J. Engel, T. Schops, and D. Cremers, “LSD-SLAM: Large-scale direct monocular SLAM,”
in European Conference on Computer Vision (ECCV), September 2014.
[18] K. Boikos and C.-S. Bouganis, “Semi-dense SLAM on an FPGA SoC,” in Field Pro-
grammable Logic and Applications (FPL), 2016 26th International Conference on, pp. 1–4,
IEEE, 2016.
[19] K. Boikos and C.-S. Bouganis, “A high-performance system-on-chip architecture for direct
tracking for SLAM,” in Field Programmable Logic and Applications (FPL), 2017 27th
International Conference on, pp. 1–7, IEEE, 2017.
[20] K. Boikos and C.-S. Bouganis, “A Scalable FPGA-Based Architecture for Depth Estima-
tion in SLAM,” in Applied Reconfigurable Computing (C. Hochberger, B. Nelson, A. Koch,
R. Woods, and P. Diniz, eds.), (Cham), pp. 181–196, Springer International Publishing,
2019.
[21] S. Finsterwalder, Eine Grundaufgabe der Photogrammetrie und ihre Anwendung auf Bal-
lonenaufnahmen, von S. Finsterwalder... Die k. Akademie, 1904.
[22] G. Gallego, E. Mueggler, and P. F. Sturm, “Translation of ”zur ermittlung eines ob-
jektes aus zwei perspektiven mit innerer orientierung” by erwin kruppa (1913),” CoRR,
vol. abs/1801.01454, 2018.
[23] P. Sturm, “A historical survey of geometric computer vision,” in Computer Analysis of
Images and Patterns, pp. 1–8, Springer, 2011.
[24] E. Kruppa, Zur Ermittlung eines Objektes aus zwei Perspektiven mit innerer Orientierung.
Holder, 1913.
[25] C. G. Harris, M. Stephens, et al., “A combined corner and edge detector.,” in Alvey vision
conference, no. 50, 1988.
BIBLIOGRAPHY 175
[26] M. Trajkovic and M. Hedley, “Fast corner detection,” Image and vision computing, vol. 16,
no. 2, pp. 75–87, 1998.
[27] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International
journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
[28] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robust features,” in European
conference on computer vision, pp. 404–417, Springer, 2006.
[29] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to
SIFT or SURF,” in Computer Vision (ICCV), 2011 IEEE international conference on,
pp. 2564–2571, IEEE, 2011.
[30] S. Leutenegger, M. Chli, and R. Y. Siegwart, BRISK: Binary robust invariant scalable
keypoints. IEEE, 2011.
[31] H. Jin, P. Favaro, and S. Soatto, “Real-time 3d motion and structure of point features: a
front-end system for vision-based control and interaction,” in Proceedings IEEE Conference
on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), vol. 2,
pp. 778–779, IEEE, 2000.
[32] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM: Real-time single
camera SLAM,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 6,
pp. 1052–1067, 2007.
[33] G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in
Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Aug-
mented Reality, pp. 1–10, IEEE Computer Society, 2007.
[34] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendon-Mancha, “Visual simultaneous
localization and mapping: a survey,” Artificial Intelligence Review, vol. 43, no. 1, pp. 55–
81, 2015.
176 BIBLIOGRAPHY
[35] Z. Lu, Z. Hu, and K. Uchimura, “SLAM estimation in dynamic outdoor environments: A
review,” in International Conference on Intelligent Robotics and Applications, pp. 255–267,
Springer, 2009.
[36] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, and J. McDonald, “Real-
time large-scale dense rgb-d slam with volumetric fusion,” The International Journal of
Robotics Research, vol. 34, no. 4-5, pp. 598–626, 2015.
[37] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leutenegger, “Elastic-
fusion: Real-time dense slam and light source estimation,” The International Journal of
Robotics Research, vol. 35, no. 14, pp. 1697–1716, 2016.
[38] J. McCormac, R. Clark, M. Bloesch, A. Davison, and S. Leutenegger, “Fusion++: Volu-
metric object-level slam,” in 2018 International Conference on 3D Vision (3DV), pp. 32–
41, IEEE, 2018.
[39] A. Handa, R. A. Newcombe, A. Angeli, and A. J. Davison, “Real-time camera tracking:
When is high frame-rate best?,” in European Conference on Computer Vision, pp. 222–235,
Springer, 2012.
[40] J. Engel, V. Usenko, and D. Cremers, “A photometrically calibrated benchmark for monoc-
ular visual odometry,” in arXiv:1607.02555, July 2016.
[41] R. Mur-Artal, J. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accurate monocular
slam system,” Robotics, IEEE Transactions on, vol. 31, no. 5, pp. 1147–1163, 2015.
[42] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and map-
ping in real-time,” in Computer Vision (ICCV), 2011 IEEE International Conference on,
pp. 2320–2327, IEEE, 2011.
[43] J. Stuhmer, S. Gumhold, and D. Cremers, “Real-time dense geometry from a handheld
camera,” in Joint Pattern Recognition Symposium, pp. 11–20, Springer, 2010.
[44] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE transactions on
pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611–625, 2018.
BIBLIOGRAPHY 177
[45] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual–
inertial odometry using nonlinear optimization,” The International Journal of Robotics
Research, vol. 34, no. 3, pp. 314–334, 2015.
[46] A. Handa, T. Whelan, J. McDonald, and A. J. Davison, “A benchmark for RGB-D vi-
sual odometry, 3D reconstruction and SLAM,” in 2014 IEEE international conference on
Robotics and automation (ICRA), pp. 1524–1531, IEEE, 2014.
[47] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision. Cambridge
University Press, 2003.
[48] J. J. More, “The levenberg-marquardt algorithm: implementation and theory,” in Numer-
ical analysis, pp. 105–116, Springer, 1978.
[49] G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “g2o: A general framework for
graph optimization,” in IEEE International Conference on Robotics and Automation, 2011.
[50] P. Greisen, S. Heinzle, M. Gross, and A. P. Burg, “An FPGA-based processing pipeline for
high-definition stereo video,” EURASIP Journal on Image and Video Processing, vol. 2011,
no. 1, p. 18, 2011.
[51] A. Cornu, S. Derrien, and D. Lavenier, “HLS tools for FPGA: Faster development with
better performance,” in International Symposium on Applied Reconfigurable Computing,
pp. 67–78, Springer, 2011.
[52] B. Vincke, A. Elouardi, A. Lambert, and A. Merigot, “Efficient implementation of ekf-slam
on a multi-core embedded system,” in IECON 2012-38th Annual Conference on IEEE
Industrial Electronics Society, pp. 3049–3054, IEEE, 2012.
[53] S. Lee and S. Lee, “Embedded visual SLAM: Applications for low-cost consumer robots,”
IEEE Robotics & Automation Magazine, vol. 20, no. 4, pp. 83–95, 2013.
[54] R. Voigt, J. Nikolic, C. Hurzeler, S. Weiss, L. Kneip, and R. Siegwart, “Robust embedded
egomotion estimation,” in 2011 IEEE/RSJ International Conference on Intelligent Robots
and Systems, pp. 2694–2699, IEEE, 2011.
178 BIBLIOGRAPHY
[55] S. Weiss, M. W. Achtelik, S. Lynen, M. Chli, and R. Siegwart, “Real-time onboard
visual-inertial state estimation and self-calibration of mavs in unknown environments,”
in Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 957–
964, IEEE, 2012.
[56] M. Sanfourche, V. Vittori, and G. Le Besnerais, “evo: A realtime embedded stereo odom-
etry for mav applications,” in 2013 IEEE/RSJ International Conference on Intelligent
Robots and Systems, pp. 2107–2114, IEEE, 2013.
[57] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct monocular visual odome-
try,” in Robotics and Automation (ICRA), 2014 IEEE International Conference on, pp. 15–
22, IEEE, 2014.
[58] J. Engel, J. Sturm, and D. Cremers, “Scale-aware navigation of a low-cost quadrocopter
with a monocular camera,” Robotics and Autonomous Systems, vol. 62, no. 11, pp. 1646–
1656, 2014.
[59] T. Schops, J. Engel, and D. Cremers, “Semi-dense visual odometry for AR on a smart-
phone,” in International Symposium on Mixed and Augmented Reality, September 2014.
[60] A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, and V. Sze, “Navion: A fully integrated
energy-efficient visual-inertial odometry accelerator for autonomous navigation of nano
drones,” IEEE Symposium on VLSI Circuits, 2018.
[61] J. Weberruss, L. Kleeman, D. Boland, and T. Drummond, “FPGA acceleration of mul-
tilevel ORB feature extraction for computer vision,” in Field Programmable Logic and
Applications (FPL), 2017 27th International Conference on, pp. 1–8, IEEE, 2017.
[62] J. Nikolic, J. Rehder, M. Burri, P. Gohl, S. Leutenegger, P. T. Furgale, and R. Siegwart,
“A synchronized visual-inertial sensor system with fpga pre-processing for accurate real-
time slam,” in 2014 IEEE international conference on robotics and automation (ICRA),
pp. 431–437, IEEE, 2014.
[63] D. Bouris, A. Nikitakis, and I. Papaefstathiou, “Fast and efficient FPGA-based feature
detection employing the SURF algorithm,” in Field-Programmable Custom Computing
BIBLIOGRAPHY 179
Machines (FCCM), 2010 18th IEEE Annual International Symposium on, pp. 3–10, IEEE,
2010.
[64] L. Yao, H. Feng, Y. Zhu, Z. Jiang, D. Zhao, and W. Feng, “An architecture of optimised sift
feature detection for an fpga implementation of an image matcher,” in Field-Programmable
Technology, 2009. FPT 2009. International Conference on, pp. 30–37, IEEE, 2009.
[65] C. Banz, S. Hesselbarth, H. Flatt, H. Blume, and P. Pirsch, “Real-time stereo vi-
sion system using semi-global matching disparity estimation: Architecture and FPGA-
implementation,” in 2010 International Conference on Embedded Computer Systems: Ar-
chitectures, Modeling and Simulation, pp. 93–101, IEEE, 2010.
[66] T. M. Howard, A. Morfopoulos, J. Morrison, Y. Kuwata, C. Villalpando, L. Matthies, and
M. McHenry, “Enabling continuous planetary rover navigation through fpga stereo and
visual odometry,” in Aerospace Conference, 2012 IEEE, pp. 1–9, IEEE, 2012.
[67] D. Honegger, H. Oleynikova, and M. Pollefeys, “Real-time and low latency embedded
computer vision hardware based on a combination of fpga and mobile cpu,” in Intelligent
Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference on, pp. 4930–
4935, IEEE, 2014.
[68] J. Sturm, W. Burgard, and D. Cremers, “Evaluating egomotion and structure-from-motion
approaches using the TUM RGB-D benchmark,” in Proc. of the Workshop on Color-Depth
Camera Fusion in Robotics at the IEEE/RJS International Conference on Intelligent Robot
Systems (IROS), 2012.
[69] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and
R. Siegwart, “The EuRoC micro aerial vehicle datasets,” The International Journal of
Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016.
[70] Intel, “Intel Core i7-4700MQ Processor product specification.”
https://ark.intel.com/content/www/us/en/ark/products/75117/
intel-core-i7-4700mq-processor-6m-cache-up-to-3-40-ghz.html, 2013. [On-
line; accessed Mar-2019].
180 BIBLIOGRAPHY
[71] A. J. Barry, H. Oleynikova, D. Honegger, M. Pollefeys, and R. Tedrake, “Fast onboard
stereo vision for uavs,” in Vision-based Control and Navigation of Small Lightweight UAV
Workshop, International Conference On Intelligent Robots and Systems (IROS), 2015.
[72] H. Oleynikova, D. Honegger, and M. Pollefeys, “Reactive avoidance using embedded stereo
vision for mav flight,” in 2015 IEEE International Conference on Robotics and Automation
(ICRA), pp. 50–56, IEEE, 2015.
[73] Microsoft, “HoloLens (1st gen) hardware details wiki.” https://docs.microsoft.com/
en-us/windows/mixed-reality/hololens-hardware-details, 2018. [Online; accessed
March-2019].
[74] V. Developers, “Callgrind: a call-graph generating cache and branch prediction profiler.”
http://valgrind.org/docs/manual/cl-manual.html, 2018. [Online; accessed March-
2019].
[75] D. C. Jakob Engel, “LSD-SLAM: Large-Scale Direct Monocular SLAM. Code and
Datasets.” https://vision.in.tum.de/research/vslam/lsdslam, 2015. [Online; ac-
cessed Mar-2019].
[76] Xilinx, “Zynq-7000 SoC Data Sheet: Overview.” https://www.xilinx.com/support/
documentation/data_sheets/ds190-Zynq-7000-Overview.pdf, 2018. [Online; accessed
Jan-2019].