Single- and multi-FPGA Acceleration of Dense...
Transcript of Single- and multi-FPGA Acceleration of Dense...
16
Single- and multi-FPGA Acceleration of Dense Stereo Vision
for Planetary Rovers
GEORGE LENTARIS, KONSTANTINOS MARAGOS, DIMITRIOS SOUDRIS, Department of
Electrical and Computer Engineering, National Technical University of Athens (NTUA), Greece
XENOPHON ZABULIS, MANOLIS LOURAKIS, Institute of Computer Science, Foundation for
Research and Technology – Hellas (FORTH), Greece
Increased mobile autonomy is a vital requisite for future planetary exploration rovers. Stereo vision is a
key enabling technology in this regard, as it can passively reconstruct in 3D the surroundings of a rover and
facilitate the selection of science targets and the planning of safe routes. Nonetheless, accurate dense stereo
algorithms are computationally demanding.When executed on the low-performance, radiation-hardened CPUs
typically installed on rovers, slow stereo processing severely limits the driving speed and hence the science it
can be conducted in situ. Aiming to decrease execution time while increasing the accuracy of stereo vision
embedded in future rovers, this paper proposes HW/SW co-design and acceleration on resource-constrained,
space-grade FPGAs. In a top-down approach, we develop a stereo algorithm based on the space sweep paradigm,
design its parallel HW architecture, implement it with VHDL and demonstrate feasible solutions even on
small-sized devices with our multi-FPGA partitioning methodology. To meet all cost, accuracy and speed
requirements set by the European Space Agency for this system, we customize our HW/SW co-processor by
design space exploration and testing on a Mars-like dataset. Implemented on Xilinx Virtex technology, or
European NG-MEDIUM devices, the FPGA kernel processes a 1120×1120 stereo pair in 1.7−3.1 sec, utilizingonly 5.4−9.3 LUT6 and 200−312 RAMB18. The proposed system exhibits up to 32x speedup over desktop
CPUs, or 2810x over space-grade LEON3, and achieves a mean reconstruction error less than 2 cm up to 4 m
depth. Excluding errors exceeding 2 cm (which are less than 4% of the total), the mean error is under 8 mm.
Additional Key Words and Phrases: Planetary rovers, autonomous navigation, stereo vision, space sweep,
parallel architecture design, rad-hard FPGA, multi-FPGA partitioning
ACM Reference Format:George Lentaris, Konstantinos Maragos, Dimitrios Soudris and Xenophon Zabulis, Manolis Lourakis. 2019.
Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers. ACM Trans. Embedd.Comput. Syst. 18, 2, Article 16 (April 2019), 25 pages. https://doi.org/10.1145/3312743
1 INTRODUCTION
As confirmed by recent and planned activities, the exploration of planetary bodies, in particular
of Mars, is a priority for all major space agencies. Existing scenarios emphasize a high mobility
autonomous rover, which will perform both in situ exploration and sample-return missions to
further analyze the collected material back on Earth [Bajracharya et al. 2008; Ellery 2015]. Increasing
the navigational autonomy of planetary exploration rovers allows them to explore larger areas
Authors’ addresses: George Lentaris, Konstantinos Maragos, Dimitrios Soudris, Department of Electrical and Computer
Engineering, National Technical University of Athens (NTUA), Zografou, Athens, 15788, Greece, [email protected];
Xenophon Zabulis, Manolis Lourakis, Institute of Computer Science, Foundation for Research and Technology – Hellas
(FORTH), N. Plastira 100, Heraklion, 70013, Greece, [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2019 Association for Computing Machinery.
1539-9087/2019/4-ART16 $15.00
https://doi.org/10.1145/3312743
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
16:2 Lentaris, G. et al.
during a mission and amass more scientific data. Embedded computer vision is a compelling
technology in this context, due to its reliance on passive cameras with reduced power consumption,
low cost, small size, increased mechanical reliability and longevity [Matthies et al. 2007]. Stereo
vision, in particular, permits a rover to reconstruct its environment in 3D so as to survey it, detect
obstacles or science targets and safely steer itself through unknown, potentially hazardous terrain.
In its archetypal form, binocular stereo involves two images acquired from slightly different
viewpoints and a computational procedure for determining a disparity map, i.e., an association
among matching pixels of the image pair [Scharstein and Szeliski 2002]. A triangulation calculation
determines the depth of a physical point from its disparity, i.e., the distance between its homologous
projections on the two images. Owing to the epipolar constraint which restricts corresponding
pixels to lie on conjugate epipolar lines, stereo matching amounts to a 1D search problem. Even so,
stereo matching is computationally intensive and requires a significant amount of time, especially
when it concerns images of high resolution.
To withstand extreme temperature ranges and space radiation effects [Maurer et al. 2008],
planetary rovers have to utilize space-grade processors, which are 1−2 orders of magnitude slower
than equivalent terrestrial CPUs. The primary reason for this is that due to their high production
and certification costs, space-qualified electronics lag behind commercial technologies by several
fabrication process generations. For example, the MER (Spirit / Opportunity) and MSL (Curiosity)Mars missions [Ellery 2015] use respectively the RAD6000 @20MHz & RAD750 @133MHz rad-
hard CPUs, whose computing capabilities are similar to the PowerPC 601 and PowerPC 750 from
the 1990s. Both missions rely on stereo as a rover’s sole 3D sensing mechanism and schedule
stationary intervals within the rover traverse paths in order to perform the CPU-intensive stereo
processing [Bajracharya et al. 2008; Matthies et al. 2007]. Executing simultaneously with flight
software, a 256×256map of 32 disparities requires about 30 sec to complete onMER rovers [Maimone
et al. 2007; Matthies et al. 2007]. As a result of slow processing, the rovers are controlled through a
prolonged command-execute-telemetry-deploy cycle, which scientists wish reduced [Ellery 2015].
Foreseeing the future, increased accuracy algorithms such as the one proposed in this paper will
increase the stereo workload by three orders of magnitude due to high-definition images, more
depths/disparities examined and more sophisticated matching metrics.
The aforementioned workload poses a great challenge to the task of improving a rover’s stereo
vision accuracy and execution speed for the eventual goal of advancing its autonomy. Practically,
this goal cannot be realized via rad-hard CPUs alone and mandates the use of hardware accelerators.
As demonstrated by our recent comparative analysis involving a large number of diverse processing
platforms as candidate accelerators for space avionics [Lentaris et al. 2018], Field Programmable Gate
Arrays (FPGAs) offer the highest performance per watt among all choices. Hence, they constitute
the most viable option for accelerating computationally demanding operations in space under a
limited power budget. High-density chips, such as Xilinx Virtex-5QV or Microsemi ProASIC3, are
already used in space and can serve as a reliable basis for future HW/SW co-design solutions in
rover navigation. Guided by the European Space Agency (ESA), in this paper we study, design, and
implement a HW/SW embedded system tailored to stereo vision on Martian rovers by utilizing a
rad-hard CPU and single or multiple rad-hard by design FPGAs of various technologies.
At platform level, we assume a generic rover setup [Kostavelis et al. 2014] with two navigation
stereo cameras mounted on a pan-tilt unit (PTU) atop a mast at a height of 1m. The stereo cameras
are placed 20cm apart (baseline distance), each with a resolution of 1120×1120 elements (5.5 µmsquare pixels), 6.6mm focal length and 50
ofield of view. They are parallel to each other, form an
angle of 39owith the horizon and capture a volume from 0.48 m to 4 m on level ground in front of
the rover. The PTU can pan the cameras in three directions (−35o , 0o , 35o ) to acquire three stereo
pairs, jointly covering a field of 120owith 25% overlap. As per the requirements of ESA, such a
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers 16:3
triple acquisition will take place with the rover being stationary and must complete within 20 sec,
including the time for stereo processing, reorienting the cameras and merging partial results.
Besides the aforementioned 20 sec budget, the requirements set by ESA for the embedded system
also concern accuracy and HW cost. Specifically, the average reconstruction error for the cloud
of output points is required to be less than 2 cm up to 4 m depth (reflecting the fact that stereo is
inherently less accurate for distant targets). This accuracy is to be verified with a synthetic dataset
depicting a Mars-like environment with diffuse lighting, low contrast and a mixture of fine grained
sand, rock outcrops and surface rocks (cf. Fig. 2). In terms of HW cost, the solution should use a
rad-hard LEON processor with 150 MIPS processing power and FPGA(s) of space-grade technology;
the possibility of using one Xilinx Virtex-5QV (82K LUTs, 320 DSPs, 298 RAMB36) or multiple
smaller FPGAs of European origin (e.g., the NanoXplore NG-MEDIUM [Le Mauff 2018], with 34K
LUT4, 112 DSPs, 56 RAMB48) should also be examined.
The contributions of this paper are i) to develop and evaluate a space sweep stereo algorithm
customized to Martian scenarios, ii) to describe the first, to the best of our knowledge, FPGA
implementation of space sweep stereo in the literature, iii) to quantify the benefits of using FPGA
in embedded systems for future rovers by performing HW/SW co-design and developing parallel
HW architectures that achieve 1000x speed-up on space-grade chips and iv) to devise a multi-FPGA
partitioning methodology and demonstrate high-performance results even when employing small
FPGAs. Regarding our algorithmic and HW acceleration innovations, primarily we i) combine
sweeping planes at exponentially increasing distances from the cameras with the modified nor-
malized cross-correlation metric for improved depth accuracy, ii) combine deep-pipelining and
on-the-fly computation with highly-utilized HW reuse instead of trivial pixel/depth parallelization,
iii) rely on parametric VHDL and design space exploration for fine tuning, iv) explore and combine
low-level synchronization techniques for fast multi-FPGA communication. Consequently, the final
solution meets all ESA requirements by exploiting the trade-off between circuit size and execution
time. Instead of maximizing HW parallelization as often done in terrestrial applications with relaxed
HW constraints, our FPGA design employs only 5.4K LUTs to process a stereo pair with 2.5 Mpixels
in less than 3.1 sec, thus allowing a triplet of reconstructions and related tasks (PTU movement,
concatenation, etc) to be completed in the allotted 20 sec with limited HW resources.
Proceeding from the algorithmic to the implementation level, the remainder of the paper is
organized as follows. Section 2 reviews relevant published works. Section 3 presents the algorithmic
details and a SW evaluation on Mars-like datasets. Section 4 presents the proposed HW architecture
and the results of our FPGA implementation. Section 5 discusses our multi-FPGA partitioning
methodology and evaluates the results on space-grade devices. Section 6 concludes the paper.
2 RELATEDWORK
2.1 Binocular Stereo Algorithms
According to the comprehensive taxonomy of [Scharstein and Szeliski 2002], binocular stereo
algorithms are comprised of a combination of the following steps: a) computation of matching cost,
b) aggregation of support, c) disparity computation and d) disparity refinement. Applied to pixel
windows, the matching cost measures the affinity of two images using criteria such as the sum of
squared/absolute differences (SSD/SAD), the normalized cross correlation (NCC) or the modified
NCC (MNCC) [Hirschmüller and Scharstein 2009; Moravec 1977]. Aggregation of support refers
to whether matching scores are combined locally or globally. Local methods aggregate cost by
summing within finite support regions. Thus, they emphasize the cost computation and aggregation
steps and often estimate disparities with a greedy “winner-take-all” strategy which identifies the
minimum cost disparity in each support region. On the other hand, global methods perform most
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
16:4 Lentaris, G. et al.
computations during the disparity computation step and typically lack any aggregation step. The
disparity computation step disambiguates potential matches by solving an optimization problem
under smoothness constraints. Preserving depth discontinuities is one of the major challenges
that needs to be overcome during disparity computation. Refinement of disparities concerns post-
processing operations such as sub-pixel interpolation, identification of occlusions, elimination
of spurious matches or filling of holes at textureless regions. The 3D scene structure is finally
recovered by triangulating matching viewing rays, with metric scale obtained from the knowledge
of the extrinsic calibration parameters [Lourakis and Zabulis 2013].
Compared to local ones, global stereo techniques generally produce the most accurate results.
Nevertheless, the inaccuracies of local methods are mainly exhibited at depth discontinuities or
weakly textured regions and have a limited impact on the overall shape of the reconstructed surface.
As concerns their computational requirements, global methods need large amounts of memory
and have irregular data access patterns. On the other hand, local methods are less computationally
intensive and simpler to implement, therefore are usually preferred for time-constrained appli-
cations. Besides, their extensive data-level parallelism, smaller memory footprint as well as their
restricted data access pattern makes local stereo methods better suited for parallelization on special
computing platforms such as FPGAs and Digital Signal Processors (DSPs).
Plane sweep is a local stereo algorithm that performs multi-image stereo matching with arbitrary
relative camera configurations [Collins 1996]. Originally targeted at the reconstruction of sparse
features, it was later adapted to dense depth estimation [Yang and Pollefeys 2003]. It works by
sweeping a hypothetical plane through a scene andmeasuring the photo consistency of input images
as they are backprojected onto the swept planes, without prior rectification. Image projections
of a scene point at a certain depth should be highly correlated when backprojected on the plane
swept through the particular depth corresponding to their pre-image (Fig. 1, left). Plane sweep is
conceptually simple and amenable to highly parallel hardware implementation that can achieve
superior performance, hence it has been often employed for efficiently reconstructing outdoor
environments [Pollefeys et al. 2008; Schöps et al. 2017]. The accuracy of plane sweep can be
improved by choosing the sweeping direction according to the orientation of dominant structures
in a scene, e.g. the ground plane [Gallup et al. 2007]. Its computational complexity can be directly
modulated with respect to the precision of the depth map, both regarding pixel resolution as well
as depth precision. In this manner, it is possible to dedicate less time for a coarser reconstruction of
a scene, and thus obtain an anytime algorithm. This behavior is not trivially feasible with other
local stereo approaches. Owing to the above reasons, plane sweep is the approach adopted in this
work for realizing planetary rover stereo vision.
2.2 FPGA Implementations of Stereo Algorithms
The literature provides several FPGA implementations of stereo vision algorithms. In few cases
automatic HLS was employed to increase productivity at the expense of throughput [Rupnow et al.
2011], however, for HW efficiency purposes, the majority of works rely on manual VHDL/RTL
coding. Earlier approaches [Ambrosch and Kubinger 2010; Banz et al. 2010; Jin and Maruyama 2014;
Jin et al. 2010; Tomasi et al. 2012] generally concern images of VGA (i.e., 640 × 480) resolution or
lower, and process a few hundred frames per second (FPS) with limited disparity search ranges (e.g.,
230 FPS for 64 disparities in [Jin et al. 2010]). Later works such as [Greisen et al. 2011; Lentaris et al.
2012; Papadimitriou et al. 2013; Wang et al. 2015; Zicari et al. 2012] deal with Mega pixel images.
In particular, [Lentaris et al. 2012] accelerates an ordinary, fixed-window stereo algorithm on
a Xilinx Virtex6 FPGA as a proof-of-concept for use by future planetary rovers. This algorithm,
hereinafter referred to as gad, computes two disparity maps (used for left-right consistency cross-
checking) by using Gauss-weighted cost aggregation of absolute pixel differences in 7×7windows to
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers 16:5
examine 200 disparities in 1120x1120 stereo images; subpixel disparities are computed with parabola
fitting. Aside from the Gaussian weighting and the different image resolutions and disparity levels
considered, gad is very similar to the stereo algorithm on-board the MER rovers [Matthies et al.
2007]. gad’s architecture relies on pixel-level pipelining and extensive resource reuse of on-chip
memory to consume only 3K LUTs and 101 RAMB36 while processing an image in 2.3 sec. In [Wang
et al. 2015], Altera Stratix FPGAs are used to accelerate a stereo algorithm employing AD-Census
cost computation, cross-based cost aggregation and semi-global optimization. The final system
processes up to 1600x1200 pixel images with 128 disparity levels at 42 FPS. However, it requires up
to 222K ALUTs and 16.6 Mbit RAM. [Tomasi et al. 2012] utilize a Xilinx Virtex4 FPGA to rectify
and process stereo images of VGA resolution at 57 FPS. The implementation uses a fine pipelined
method and multiple user defined parameters, which result in a total of 58K LUTs, 131 DSPs, and
100 RAMB18. Disparity computation is accelerated in [Greisen et al. 2011] with 54K LUTs (or
100K LUTs for the entire stereo pipeline) on Altera Stratix3 and processes 1080p images with 256
disparities at 30 FPS. [Papadimitriou et al. 2013] accelerate stereo matching with 38K LUTs on
Xilinx Virtex5 and process 1920x1200 images with 64 disparities at 87 FPS. [Pérez-Patricio and
Aguilar-González 2015] implement an adaptive window algorithm on an Altera DE2 board and
achieve 76 FPS for 1280 × 1024 images and 15 disparities.
Regarding multi-FPGA implementations, the relevant literature is very limited. [Darabiha et al.
2006] partition a phase-based stereo algorithm on a board containing four Xilinx Virtex 2000E
FPGAs and obtain dense disparity maps at a speed of 30 FPS for 256×360 images. The work of [Choi
and Rutenbar 2016] concerns a custom Markov Random Field inference system for stereo matching,
which relies on sequential tree-reweighted message passing and is accelerated on a Convey HC-1
platform, i.e., four Xilinx Virtex-5 V5LX330 FPGAs. [Lee et al. 2013] utilize two stacked Synopsys
HAPS-64 multi-FPGA boards for accelerating a mobile ray tracing architecture partitioned to eight
Virtex-6 LX760 FPGA chips.
In summary, a review of the literature reveals that FPGA stereo implementations for high-
definition images with increased depth accuracy (e.g., 1 Mpixel and at least 100 disparities), achieve
high FPS at the cost of increased FPGA resources, i.e. 38−222K LUTs. HW cost of such magnitudes
would over-utilize the limited size chips targeted in the current paper. In contrast, our work
concerns embedded systems with limited-capacity, space-grade HW and must attain the right
trade-off between FPGA resource utilization and speed sufficiency. Thus, given our goal of maximal
HW efficiency (i.e. throughput over cost), our approach is to trade one order of magnitude FPGA
resources for lower FPS, while maintaining conformance to ESA’s speed and accuracy requirements.
3 THE PLANE SWEEP ALGORITHM
3.1 Design and Software Development
Details of the stereo algorithm we developed, hereafter denoted as psweep, are given in this section.
psweep uses a plane as its sweeping surface, all instances of which are parallel to each other. They
are parameterized by their distance from a reference center O , typically chosen as the “cyclopean
eye”, placed midway between the two camera centers. psweep operates by sampling the swept
plane into cells (pixels) and translating it through space along a sweeping direction. Typically,
the swept plane is oriented to be approximately parallel to the two image planes, but this is not
mandatory if a justifiable prior can be presumed [Gallup et al. 2007].
The swept plane creates a family of parallel planes, which are denotedWi . At each posture i ofthe sweeping plane, the images of the input stereo pair, I1 and I2, are backprojected upon planeWi ,
forming imagesC1i andC2i ; for reference in Sec. 4, this operation is called image projection. Bymeans
of this projection, each point ofWi associates two viewing rays that intersect upon it. Fig. 1 (left)
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
16:6 Lentaris, G. et al.
Fig. 1. Left: Plane sweep illustration. Images acquired by the two cameras (red, green) are backprojected
upon hypothetical sweeping planes extending from the cyclopean eye (gray). Points A and B lie on such a
plane and are respectively tangent and non-tangent upon a physical surface. Middle: sample depth map in
pseudo-color obtained with the psweep algorithm for the left stereo pair of Fig. 2. Right: Photoconsistency
patterns as a function of depth. Textured surfaces give rise to clear local photoconsistency maxima (green).
Surfaces that lack texture may yield high photoconsistency values, but do not give rise to local maxima (red).
shows two such pairs of associated rays. The corresponding pixels inC1i andC2i are expected to be
photoconsistent if the intersection point of the ray withWi is tangent to a physical surface, because
they image the same real point (i.e. point A in the figure). In contrast, the pixels corresponding to
rays that intersect at a non-tangent point, i.e. B, are not expected to be photoconsistent because
they image different physical points. This observation constitutes the reconstruction principle of
plane sweep and determines a 3D point at the distance that photoconsistency is maximized, for each
viewing ray from the cyclopean eye. Subsequent adaptations of this idea have used more efficient
point sampling on the sweeping plane [Gallup et al. 2008], considered non-planar shapes for the
sweeping surface [Pollefeys and Sinha 2004; Zabulis et al. 2006], or exploited prior knowledge on
the location and orientation of the imaged structures [Gallup et al. 2007].
Most flavors of plane sweep assess photoconsistency using support windows (i.e., image patches)
centered at the points of interest inC1i andC2i . Grayscale intensities in two patches are compared us-
ing metrics like SAD, SSD, NCC andMNCC [Moravec 1977]. NCC andMNCC are routinely preferred
as they can withstand gain and bias changes, hence do not require the employed cameras to be ra-
diometrically calibrated. For two patches P1 and P2, NCC is defined as cov(P1, P2)/(√σ 2(P1)σ 2(P2)).
MNCC is defined as 2 cov(P1, P2)/(σ 2(P1) + σ 2(P2)), or, in an extended form
MNCC(P1, P2) =2 (n∑n
k∑nl P
k1P l2−∑n
k Pk1
∑nl P
l2)
n∑n
k (Pk1 )2 − (∑nk P
k1)2 + n∑n
l (P l2)2 − (∑nl P
l2)2, (1)
where Pk1, P l
2denote pixels k and l from patches P1 and P2, both of which consist of n pixels. The
advantages of MNCC over standard NCC are that it is faster to compute and tends to zero when
there is a significant difference in variance between the patches P1 and P2, closely approximating
NCC for equivariant patches.
The photoconsistency values for each cell of planeWi and depth d may be used to form a “photo-
consistency image”Mi , which is “stacked” in a 3D matrixM(x ,y,d). To quantify photoconsistency,
psweep adopts the MNCC metric because it exhibits increased robustness to textureless image
regions while providing reasonable performance at a low computational cost. The MNCC is com-
puted in square image patches with an acceptance threshold of 0.85; this is referred to as MNCCcalculation in Sec. 4. Image patches for MNCC are 13× 13 pixels, with similar results obtained using
slightly smaller or larger patch sizes. psweep’s implementation avoids recomputing intermediate
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers 16:7
Fig. 2. Sample stereo pairs from the “rocky” (left) and “flat” (right) terrain types employed for evaluation.
The “flat” dataset contains weaker texture.
terms of the MNCC cost function, such as the sum of pixels within its support window or the
sum of their squares for each pixel. This is achieved by computing these intermediate quantities
through a convolution with a constant kernel of the same size as the integration window. Such a
convolution can be highly optimized on FPGAs. At each pixel, psweep computes depth using a
winner-take-all policy followed by interpolation via parabola fitting to improve resolution. The
combination of these two steps is referred to as map updating in Sec. 4. Fig. 1 (middle) shows the
depth map obtained with psweep for the left stereo pair of Fig. 2. 3D points are recovered from
their image coordinates and computed depths; this is called reconstruction in Sec. 4.
A crucial difference of psweep from ordinary plane sweep variants is that it requires that the
optimum giving rise to a reconstructed point is local. That is, the photoconsistency values at
both the preceding and succeeding distances should be suboptimal [Vogiatzis et al. 2007]. The
reason is that textureless regions provide spuriously high values of maximal photoconsistency
and provide a photoconsistency pattern that converges asymptotically to the maximum possible
photoconsistency value (see Fig. 1, right). As in [Collins 1996], the entire volume of photoconsistency
data is not retained in memory. As the sweeping plane proceeds, a 2D buffer marks for each pixel
the posture that has yielded the best score up to that instant; when sweeping concludes, it contains
the estimated depth for each pixel. To facilitate the determination of whether a particular value
is a local optimum, this buffer has three cyclically updated layers which store the preceding and
succeeding photoconsistency values.
For simplicity, the spacing of the sweeping planes could be kept uniform. In the current work, a
more efficient parameterization is realized by choosing these planes to be exponentially sparser
with distance [Pollefeys et al. 2008], accounting for image discretization (see Fig. 1, left).
3.2 Dataset, Metrics and Software Evaluation Setup
To evaluate the accuracy performance of stereo reconstruction, a binocular dataset with known
depth ground truth was synthetically generated. More specifically, a total of 68 stereo images
corresponding to various locations along two simulated rover trajectories were graphically rendered
and their depth maps retained. These images are of size 1120 × 1120 pixels and represent two
different terrain typologies (namely “rocky” and “flat”) in a Mars-like environment; sample frames
are shown in Fig. 2. To gauge stereo performance, we define the following metrics:
M1: Root mean square (RMS) error for reconstructed and ground truth depths whose difference is
at most 2 cm. This is the primary metric, intended to measure the stereo accuracy excluding
large errors.
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
16:8 Lentaris, G. et al.
Table 1. Accuracy metrics for the “rocky” and “flat” Mars-like datasets.
rocky flatMetric psweep gad psweep gad
RMS (M1, mm) 7.50 5.33 7.86 4.95
Large errs fraction (M2.a, %) 2.54 4.60 3.94 3.25
Large errs RMS (M2.b, mm) 102 1065 124 1065
Coverage ratio (M3, %) 98.57 98.25 84.44 98.33
Mean hole size (M4, pix) 87.96 371.41 101.99 174.59
Consolidated RMS (mm) 17.87 228.53 25.88 192.12
M2.a: Fraction of points for which the depth error is more than 2 cm. This metric is meant to
measure the frequency of points with large errors. Metric M2.b corresponds to the RMS error
for such erroneous points.
M3: Coverage fraction measuring the ratio of points that are reconstructed over those that could
be potentially reconstructed. This metric depends on scene texture and certain algorithmic
parameters, e.g. the MNCC threshold.
M4: Mean hole size defined as the ratio of not reconstructed (i.e. hole) pixels over the number of
connected components corresponding to hole pixels. This metric indicates the average size
of image patches for which no depth measurement is available.
Quality metrics similar to M1 and M2.a above are used in [Scharstein and Szeliski 2002], albeit
defined on the disparities rather than the depths.
3.3 Performance Results
First, psweep was implemented and substantially optimized in ANSI C. The depths obtained from
this purely SW implementation were used as a baseline against which potential performance
degradations due to simplifications in the FPGA implementation were compared. Furthermore,
to analyze accuracy, the estimated depths as well as those computed with 13 × 13 masks by the
implementation of gad from [Lentaris et al. 2012], were compared with the true depths using the
metrics presented in Section 3.2. For both stereo algorithms, Table 1 lists the values computed for the
various metrics from the “rocky” and “flat” datasets. It also includes the consolidated RMS error for
all reconstructed points, regardless of magnitude. The RMS error expressed by metric M1 is slightly
better for gad (around 5.0 mm versus 7.5 mm for psweep). This difference is nevertheless small and
can be attributed to the idiosyncrasies of the dataset marginally favoring one algorithm over the
other. M2.a, the fraction of points reconstructed with errors exceeding 2 cm is also comparable for
both algorithms and in all cases less than 5%. However, as can be seen from the third row of Table 1
which provides M2.b, the RMS error for points with errors larger than 2 cm, points reconstructed
with gad have errors which are one order of magnitude larger compared to the errors for the points
obtained with psweep. This is because psweep generates fewer spurious reconstructed points due
to its strategy of selecting local optimums in photoconsistency (cf. Sec. 3.1 and Fig. 1 right). Thus,
psweep has better overall accuracy, as can be confirmed by its significantly lower consolidated
RMS. The coverage ratio metric M3 well exceeds 80% for both datasets. M4, the average area of
holes is around 100 pixels for psweep and is smaller than that for gad.
To facilitate a better understanding of psweep stereo reconstruction errors, Fig. 3 provides two
histogram plots. Specifically, for the points reconstructed from each dataset, these plots illustrate
their total number and mean reconstruction error against the distance of reconstructed points from
the cyclopean eye. As can be clearly seen in the left plot, the majority of reconstructed points are
located near the stereo system. This is expectable, since compared to distant ones, closer areas
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers 16:9
are imaged in greater detail, i.e., with a larger pixel count. The right plot in Fig. 3 shows that
reconstruction errors are larger for points further away from the stereo system, demonstrating the
well-known fact that the stereo measurement error is proportional to the square of depth. Due to
the synthetic texture’s granularity not being sufficiently high, very close surfaces are rendered with
low spatial frequencies and hence give rise to errors in stereo matching. This explains why the
errors for the few first closest distances are larger than those for the immediately farther distances.
Both plots in Fig. 3 start at 1 m, which is the minimum distance psweep is applied at, and extend up
to 4 m according to ESA’s requirements. The mean error in the right plot remains less than 1.6 cm.
We have also evaluated the performance of psweep and gad using version 2 of the Middlebury
stereo benchmark [Scharstein and Szeliski 2015]. Since the notion of stereoscopic disparity is central
to conventional stereo, the Middlebury benchmark employs disparities both to encode ground
truth and also to define performance metrics. Based on a set of test stereo images and their true
disparities, the benchmark involves three error metrics (nonocc, all, disc) defined as percentages of
erroneously matched pixels in certain image regions. The metrics are computed for every disparity
field estimated by a stereo algorithm being evaluated and used to rank this algorithm according
to each error metric and disparity field tested. The algorithm’s overall rank is then obtained from
the average of its ranks for each error metric and test image [Scharstein and Szeliski 2002]. We
note at this point that psweep does not compute disparities but rather 3D depths directly. Thus, in
order to facilitate the evaluation of psweep with the Middlebury benchmark, we used the following
post-processing procedure to compute disparities from its 3D reconstruction output. A pixel in
the reference image is associated with its reconstructed point in the cyclopean coordinate frame.
Owing to the images being rectified, the optical axes of all cameras are parallel and therefore,
reconstructed points have the same depth Z in the coordinate frames of all cameras. Then, the
disparity d of a pixel for which a 3D point with depth Z has been reconstructed with psweep is
computed with d = f b/Z , where f is the focal length and b the stereo baseline.
Using Middlebury’s evaluation protocol with the disparities computed by psweep and gad for
the test images, we computed the percentages of bad matching pixels listed in Table 2. When these
performance metrics were compared against those of 167 other disparity-based stereo algorithms
evaluated in the benchmark’s web page [Scharstein and Szeliski 2015], psweep ranked 38th(with a
score of 47.6, smaller is better) and gad 96th(scoring 89.9). Thus, psweep is in the top quartile (i.e.,
25%) of the algorithms and gad in the top 60%.We remark that several of the top performingmethods
on the Middlebury benchmark either involve expensive cost aggregation strategies, e.g. [Yang et al.
2014; Yoon and Kweon 2006; Zhan et al. 2016] or employ global disparity computation, e.g. [Liu
et al. 2015; Mozerov and van de Weijer 2015; Yang et al. 2009], and therefore are unsuitable for
implementation on resource-constrained FPGAs.
The average percentages of bad matching pixels over all test images were 2.43% for psweep
and 6.42% for gad. With these average bad matching pixel percentages, psweep ranks first in
accuracy with respect to the 19 stereo methods compared using the Middlebury test images in
Table IV of [Ttofis et al. 2015], while gad comes second. To the best of our knowledge, we have
reported above the first in the literature evaluation of a space sweep algorithm on the Middlebury
benchmark.
4 PSWEEP HARDWARE ARCHITECTURE DESIGN
To meet the time requirements of ESA and facilitate implementation on resource-constrained HW,
we accelerated psweep by applying our custom HW/SW co-design methodology [Lentaris et al.
2016]. The latter involves algorithmic profiling/analysis, HW/SW partitioning, HW architecture
design, parametric VHDL coding, system integration with CPU-FPGA communication, as well as
design space exploration, i.e. parameter tuning. Next, we describe all of our steps and their outcome.
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
16:10 Lentaris, G. et al.
Table 2. Percentages of bad matching pixels for psweep and gad on the Tsukuba, Venus, Teddy and Cones
test images from the Middlebury dataset. The rightmost column comprises the average of each row.
Tsukuba Venus Teddy Cones Bad pixelsAlgorithm nonocc all disc nonocc all disc nonocc all disc nonocc all disc average %psweep 4.29 4.95 1.41 1.21 1.83 0.31 3.67 4.43 2.51 1.38 1.97 1.21 2.43
gad 6.21 7.52 4.66 3.31 4.11 2.34 9.80 10.82 6.77 7.04 8.37 6.04 6.42
1 2 3 4Distance (m)
0
1
2
3
Rec
on
stru
cted
po
ints
no
(#) #106
rockyflat
1 2 3 4Distance (m)
0.006
0.008
0.01
0.012
0.014
0.016
Mea
n e
rro
r (m
)
rockyflat
Fig. 3. Number of points reconstructed with psweep (left) and their mean error (right) as a function of distance
from the cyclopean eye for the “rocky” and “flat” datasets.
Table 3. Profiling on Intel i5-4590 and LEON3@50MHz: 1-thread, 8-bit 1120x1120 image,kT depths (iterations).
psweep time/iteration utilization % data in memory
function (on i5-4590) (on LEON3) (Mbit) (MB)
image projection 86 msec 47.5 20
224MNCC calc. 94 msec 51.5 20×kTmap updating 1.7 msec 0.96 40×kTreconstruction 0.07 msec 0.04 40 4
total (per image) 55 sec 4778 sec 20 228
4.1 Algorithm Profiling & HW/SW Co-Design
The co-design methodology begins with a detailed analysis of the algorithm, which combines auto-
matic profilers and manual examination to partition it effectively into HW and SW components. To
accurately measure the performance of the code on a space-representative CPU, we port the entire
psweep on a soft-core LEON3 with RTEMS operating on a Xilinx FPGA at 50MHz. In addition to
execution time and memory footprint, we also assess the communication requirements of each func-
tion, the reuse of variables among functions, as well as their arithmetic requirements (fixed/floating
point operations, dynamic range, accuracy, etc). By also considering the capabilities/peculiarities of
space-representative FPGAs, e.g., Xilinx Virtex-5QV, we compare the above results to the exact
ESA specifications/budgets and we select the functions that must be accelerated on HW.
The profiling results are summarized in Table 3, which breaks up psweep in four main functions
and analyzes time, I/O, and memory usage. More specifically, assuming a total of kT = 301 swept
planes and stereo images of resolution 1120×1120, Table 3 reports the execution time on Intel Core
i5-4590 (average msec per iteration, plus total time per image), the execution time on LEON3 (in
terms of utilization per function, plus total time per image), the I/O requirements per function (total
Mbits input from the previous function), and the memory footprint. In total, psweep consumes
an excessive amount of time on LEON3, requiring well over an hour per image pair. Projecting
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers 16:11
Fig. 4. High-level HW architecture of psweep based on deep pipelining at pixel-level & on-the-fly processing
input images on the sweeping plane consumes almost half of psweep’s time. Even more expensive
is the computation of the 13 × 13-windowed MNCC function. Hence, it becomes clear that these
two functions must be accelerated on the FPGA. Map updating is almost 10x faster than projection
and MNCC. However, its increased I/O (40 Mbit per iteration) prohibits its separation from MNCC
as their communication would increase the amount of off-FPGA data transfers, thus stalling
the HW/SW co-processing. In contrast, the reconstruction transforming the final map to world
coordinates is executed only once, after the sweep, consumes little time (2 sec on LEON3) and
requires limited I/O (just the final map data). In terms of arithmetic precision, the first three
functions can operate with fixed-point arithmetic, whereas the last function relies on floating-point
transformations. Considering all the above, we decide to accelerate the entire psweep on HW
except from the final reconstruction, which will be executed on the CPU handling the I/O with
the FPGA. In this manner, we accelerate 99.9% of psweep computations and aim at a speedup
factor of around 1000x over space-grade LEON to meet ESA’s requirements. Owing to the memory
requirements exceeding the storage capacity of FPGAs, processing should proceed in a stream
mode, with pixels acted upon and results forwarded to the host progressively and with minimal
buffering. For CPU-FPGA communication, we rely on ordinary 100 Mbps Ethernet, which involves
a HW controller (CSMA/CD LAN IEEE 802.3) and a SW driver handling packets of 1500 bytes MTU.
4.2 Proposed HW Architecture
The high-level architecture of the proposed HW engine is depicted in the block diagram of Fig. 4.
Overall, our engine consists of a control unit and five components, which collectively form a very
deep pipeline starting at the “image memory” and terminating at the “depth map” memory. The
pipeline sustains an internal throughput of one MNCC value per cycle, i.e., it updates one depth
map value at every cycle in the course of examining each new depth plane (cf. section 3.1). To
sustain such throughput, the pipeline operates on a pixel basis and integrates dozens of processing
stages (70 − 84 depending on the design parameters, e.g., the MNCC size), which are fine-tuned at
register-transfer level to allow for a high clock frequency. As a result, the proposed architecture
achieves an increased level of parallelism (effectively, dozens of pixels are processed in parallel
within our deep pipeline), which is further increased by handling the left and right images of the
stereo pair in parallel. Moreover, aiming to minimize the on-chip memory utilization, our design
performs on-the-fly processing. In contrast to conventional SW approaches, our architecture avoids
storing any intermediate values/results other than the input pixels and the depth outputs, i.e., with
almost 100% HW utilization, the pipeline transforms the information as it flows from stage to stage
without employing temporary buffers, apart from a limited-size RAM needed by MNCC.
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
16:12 Lentaris, G. et al.
The central control unit initializes all components (loads the image memory, resets the registers,
etc.) and executes the main loop of psweep: at each iteration k , it assumes a new hypothetical depth
plane Dk and commands the processing modules to evaluate the current hypothesis for all pixels of
theWi ×Hi image. More specifically, assuming a sweeping plane of sizeWp ×Hp at depth Dk , the
“address generator” will scan the stored image to facilitate the projection of each pixel fromWi ×Hito a specific location on theWp ×Hp plane Dk . The image scanning is performed so that plane Dkwill be gradually filled with one pixel per cycle in a raster scan order. That is, depending on Dk ,
the scan begins from a pre-determined location ⟨x0,y0⟩ in the image and continues according to a
pre-determined non-integer step (from left to right, top to bottom). These fixed values are stored in
a look-up-table, e.g., ROM, accessed at the beginning of each iteration k . The “address generator”computes one pair of ⟨xn ,yn⟩ vectors per cycle, n ∈ [1, Wi · Hi ], one referring to the left and one tothe right image. The ⟨xn ,yn⟩ vector is translated within the “storage” module to a memory address
and a refinement vector. The memory address is used to fetch a square of 4 neighboring pixels from
the image, whereas the refinement vector is used to guide the “interpolation” module to weigh the
4 fetched pixels and generate 1 projected pixel on the depth plane Dk .
Two projected pixels per cycle (one from the right plane and one from the left) are forwarded to
the “similarity” module, which evaluates the photoconsistency of the left and right planes according
to the MNCC metric. It utilizes small internal buffers to collect multiple pixels per support window
and calculate one MNCC value per pixel location (in total,Wp ×Hp values per Dk ). The “updating”
module receives one MNCC per cycle and compares it to the best value computed so far for the
specific location on the plane. In case of a photoconsistency improvement, the depth map is updated.
Iteration k completes when the entire plane Dk has been scanned (inWp × Hp cycles) and the
control unit proceeds to the next plane Dk+1 by re-commanding the five modules of the pipeline.
The architecture is implemented with fixed-point arithmetic and parametric VHDL, which
allows the granularity of the sweep, i.e., the number of main loop iterations kT , to be configuredat compile time. The same holds for the granularity of the sweeping plane,Wp ,Hp , the image
size,Wi ,Hi , the MNCC support window size,M , and the word-lengths of the internal datapaths
(for precision tuning). Moreover, to tackle the limited on-chip memory of FPGAs, we equip our
psweep engine with an image partitioning mechanism to decompose the input data as follows.
The CPU divides the image in B horizontal bands (stripes) of size Wi × Hb , with Hb = Hi/B,which are processed successively by reusing the same FPGA resources: each band is downloaded
to the FPGA, processed in kT iterations, and the resulting depth map stripe is uploaded to the
CPU almost independently of the next band (in practice, the bands overlap at their borders to
facilitate the correct sliding of the MNCC mask between them). Band height, Hb , is also provided
as a VHDL parameter. Overall, our HW design combines deep pipelining, on-the-fly computation,
tight synchronization between modules to maximize data flow and HW utilization, parallel memory
organization and parallelization of arithmetic to sustain the pipeline throughput regardless of
local algorithmic requirements, minimization of on-chip buffers and data transfers in CPU-FPGA
communication. Changes of setup/parameters during the rover’s lifetime can be accommodated via
FPGA reconfiguration. The following paragraphs elaborate on the architectural details of the main
modules of HW psweep.
4.2.1 Image Storage and Projection. Almost one third of the on-chip memory in psweep is used
for storing the input stereo pair, i.e., the pixels of a band (updated B times per image). To support the
aforementioned pipeline throughput, the “storage”module utilizes two 4-bank parallel organizations,
one storing the left and one storing the right image. The 4 banks {A,B,C,D} are interleaved in both
x and y directions of the image (i.e., ABABAB . . . for the first row,CDCDCD . . . for the second row,etc.), so that any 4 neighboring pixels are mapped to distinct banks. Therefore, in a single cycle, we
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers 16:13
can access any square quadruplet from any location of the image without memory conflicts. The
translation of a ⟨xn ,yn⟩ pointer on the image is performed according to the mapping
bank(xn ,yn) = (⌊yn⌋ mod 2) · 2 + ⌊xn⌋ mod 2
addr(xn ,yn) = (⌊yn⌋ div 2) · (Wi/2) + ⌊xn⌋ div 2,(2)
where ⌊x⌋ ≡floor(x) denotes the integer part of x ≥ 0.
With the use of a 4-word barrel shifter, the calculated addr(xn ,yn) is forwarded to the bank(xn ,yn).In parallel, addr(xn + 1,yn), addr(xn ,yn + 1) and addr(xn + 1,yn + 1) are also forwarded to their
corresponding banks. The key-value bank(xn ,yn) is delayed via shift-registers and routed to the
output of the 4 banks to align the fetched quadruplet via a second barrel shifter according to
the ⟨xn ,yn⟩ request. While the integer part of ⟨xn ,yn⟩ is used in eq. (2) for fetching the pixel
quadruplet, the fractional part of ⟨xn ,yn⟩ is forwarded to the “interpolation” module. The fractional
part ⟨x f rn ,yf rn ⟩ forms a refinement vector, which guides a bi-linear interpolation to generate a new
pixel in-between the fetched square of pixels (practically, the xf rn and y
f rn act as weights). This
interpolant constitutes the value of a pixel being projected from the image to a specific location on
the sweeping plane Dk (the successive ⟨xn ,yn⟩ requests of the “address generator” are set up so thatthe projections cover the sweeping plane in integer raster-scan order). The fetching pipeline has 8
stages and the interpolation pipeline has 12 stages (MULT/ADD arithmetic). The two pipelines are
cascaded as shown in Fig. 4 to input one ⟨xn ,yn⟩ request and output one interpolant per cycle.
4.2.2 Similarity Metric Calculation. The “similarity” module correlates the left and right planes
Dk on-the-fly, while they are output from the “interpolation” module in a pixel-by-pixel fashion and
in raster-scan order. The module computes 5 distinct sums per cycle and combines them according to
eq. (1). The challenge is to perform all summations by accessing each pixel once (and not repeatedly
for multiple overlapping support windows), i.e., without storing the left and right planes Dk . To
achieve this optimization, we propose a parallel architecture based on serial-to-parallel buffers and
partial sum calculators. Fig. 5 depicts the proposed architecture with 2-pixel input per cycle (one
from each plane) and 1 MNCC output. Each “buffer” consists ofM FIFO RAMs, each one of depth
Wp , whereM denotes the height of theM ×M support window (e.g.,M=13) andWp the width of
the plane (e.g.,Wp=1117). The FIFOs are serially connected to each other, such that when a new
pixel enters the first FIFO, then all FIFOs update their outputs and operate as a sliding window on
the sweeping plane. Once every cycle, the window slides by one pixel in a raster scan order over
the plane, allowing a new column of M × 1 pixels to be read concurrently. The M × 1 column is
forwarded in parallel to five distinct components, which compute the five sums involved in eq. (1).
Fig. 5 depicts the structure of such a component assuming twoM-word inputs, {w j } and {ej }, withj ∈ [1,M]. Notice that in practice, depending on which sum of eq. (1) we need to compute, we
instantiate w j = 1 (for
∑Pl ), or w j = ej (for
∑P2
l ), or ej = w j (for∑P2
r ), or ej = 1 (for
∑Pr ), or
w j , ej (for∑Pl · Pr ), where Pl and Pr denote pixels from the left and right plane, respectively.
TheM input pairs are routed toM distinct multipliers to derive the {w j · ej } products, which are
added by a tree structure to form a partial sum of the corresponding term used in eq. (1). In Mconsecutive cycles,M partial sums are accumulated to calculate one complete term of eq. (1). We
note, however, that before feeding the aforementioned accumulator, we subtract the partial sum
computed M cycles in the past from the current value, because we must hold exactly M partial
sums in our accumulator in the course of scanning the plane and avoid indefinite increases. For this
reason, each new partial sum is also delayed by a depth-M shift-register. In parallel, we feed the
five calculated terms (i.e., complete sums) to the “fraction” component to generate the numerator
and denominator of eq. (1) (with adders/subtractors, squaring units, and flip-flops serving as local
synchronizers) and feed a pipelined fixed-point divider generating the MNCC output. Cascading
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
16:14 Lentaris, G. et al.
Fig. 5. Parallel architecture and HW pipelines of MNCC “similarity” module.
the above components forms a deep pipeline, consisting for example of 47 total stages forM=13
and 13-bit accuracy (23 stages in “term calculator”, 15 in “divider” and 9 for the rest), to sustain a
throughput of one MNCC result per cycle.
4.2.3 Depth Map Updating. Each MNCC result is compared to the best MNCC computed so far
for the current ⟨x ,y⟩ location on the plane over all Dk iterations. For this purpose, the “updating”
module includes a RAM of depthWp · Hp to store the minimum required information per ⟨x ,y⟩location. First, it stores the best MNCC valueVb and its corresponding depth Db , which are updated
based on the comparison to the new MNCC value Vk at Dk . The comparison is implemented via
a read-write loop around the RAM block. That is, we employ a dual-port memory and develop a
7-stage pipeline starting and ending at the RAM block to continuously a) read the RAM address
addr, b) compare the addr contents to the currentVk , c) conditionally update the addr contents withVb = Vk and Db = Dk , d) increase addr by 1. We note that the local counter addr is synchronized to
the input rate of the module (we raster-scan the entireWp ×Hp plane), whereas the writing addr is
a mere delay of the read addr (the pipelined loop includes multiple registers to synchronize the data
and assist the EDA tools in routing the circuit). Second, to facilitate the improvement of psweep’s
precision via depth interpolation, i.e., to fit a parabola around the best MNCC value Vb according
to the formula (Vb−1 −Vb+1) / (Vb−1 +Vb+1 − 2Vb ), the module stores temporarily all the MNCC
valuesVk of the current depth plane (each one in a distinct RAM address, which is updated at every
main loop iteration k unless a Db is discovered). More specifically, when a new value Vk for ⟨x ,y⟩enters the module, a designated stage of our pipeline examines whether the previous value Dk−1
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers 16:15
was selected as Db for that specific ⟨x ,y⟩. In this case, the information currently flowing in the
module includes Vb (the stored Vk−1), Vb+1 (the current input Vk ), and Vb−1 (the value Vk−2, whichwas not updated at the previous iteration k − 1, because a Db was discovered). The three values
⟨Vb−1,Vb ,Vb+1⟩ originating from the module’s I/O ports and internal RAM are synchronized and
routed to a secondary pipeline of f + 8 stages, where f denotes the fractional bits output from
depth interpolation (e.g., f = 20). This pipeline operates in parallel to the aforementioned 7-stage
primary pipeline and involves adders and a divider to compute the parabola fitting formula. The
secondary pipeline terminates at a distinct RAM of sizeWp ·Hp · f , which stores the fractional parts
of the depth map (the integer parts are indexed on CPU via the Db results). Notice that our HW
implementation possibly computes the fractional part multiple times per each ⟨x ,y⟩ in the course
of psweep, as opposed to typical SW implementations, which perform only one such computation
at the end of all kT iterations. However, with the proposed scheme of performing interpolation
on-the-fly, the HW avoids storing all Vk values for all kT iterations, i.e., we store on-chip the
minimum number of Vk values (namely 2 per ⟨x ,y⟩) and minimize the memory utilization without
stalling the computation as depth interpolation is performed in parallel to all other tasks. Upon
completion of all kT iterations, a local FSM forwards and clears the contents of the local 2-bank
RAM to the output of the HW psweep engine.
4.3 Design Space Exploration & Fine-Tuning on FPGA
To assess the cost-speed-accuracy trade-offs in HW psweep and customize our engine for various
applications/devices (e.g., commercial Virtex-6, space-grade Virtex-5, European NG-MEDIUM), we
perform a design space exploration with the aid of our parametric VHDL code. The key parameters
affecting the sweep’s granularity, i.e.,Wp ,Hp ,kT , were derived during algorithmic design to achieve
the accuracy results of section 3.3 and are retained here. For conciseness, we omit presenting few
word-lengths in internal datapathswithminor impact on cost (e.g., the fractional bits in interpolation
affecting only few ADD/MULT, or the fractional bits of output depths). We select a representative set
of parameters, sufficient for our fine-tuning of psweep, and we implement numerous configurations
to evaluate the FPGA results. Specifically, we tune the following 4 parameters with trade-offs in:
• Image rows: related to the partitioning of the input image in bands, it determines the number
of image rows stored on-chip. Hence, it trades on-chip memory for comm/execution time.
• Plane rows: related to the partitioning of the output map in bands, i.e., the rows processed in
a single burst. It trades memory for execution time (decreases overhead of pipeline refilling).
• MNCC mask: the sizeM of the MNNC support windowM×M . It trades psweep accuracy for
HW cost (M increases the number of MNCC buffers, size of MULT-ADD trees, and latency).
• MNCC bits: related to the fixed-point precision of MNCC, used for word-length optimization.
It trades psweep accuracy for HW resources (increases the bits involved in the calculations).
To illustrate the explored design space, we provide three distinct figures based on results obtained
with Xilinx ISE 14.7 for Virtex6 VLX240T: Fig. 6 depicts the FPGA resources for six configurations
with the parameter values shown in x-axis as ⟨Image rows - Plane rows - MNCC mask - MNCC bits⟩quadruples. Fig. 7 reports the total execution time of HW psweep with respect to band height, i.e.,
Plane rows, for various MNCC mask sizes, assuming a clock frequency of 172 MHz. Finally, Fig. 8
shows the accuracy of HW psweep by comparing the FPGA output to the SW results (used here
as reference, analyzed in Section 3.2) while varying the MNCC bits (i.e., the main source of HW
error, the other modules are ignored). For clarity, the y-axis of Fig. 8 shows in logarithmic scale
the number of errors, i.e., the values which are not equal to the SW results, expressed in distinct
custom units (e.g., in 500’s for depth values having more than 2 cm mean error).
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
16:16 Lentaris, G. et al.
Fig. 6. Resources of 6 psweep configurations denoted by ⟨Image rows - Plane rows -MNCC mask -MNCC bits⟩
Fig. 7. FPGA time per image versus band size .......
(image = 1120x1120x2 pixels, .. clock = 172 MHz).
Fig. 8. psweep accuracy with respect to HW datapath
bits (differences versus SW implementation, for varying
MNCC bits) for 2.5 Mpixel stereo image (1120x1120x2).
In addition to providing a thorough evaluation of psweep, figures 6, 7, and 8, guide us to an
efficient balancing of parameters for the given application. Hereafter, we fix the MNCC to 13 bits,
because additional bits provide negligible improvement in accuracy at the cost of considerable
increase in RAMB/LUT, e.g., +6% if we use 18 bits; compared to the SW reference, 13 MNCC bits
lead to 99.9% of the depth results having error less than 0.25mm, only 0.01% having error more than
2 cm (i.e., 12 in a million), and 99.7% of the map holes remain intact (HW and SW implementations
reconstruct the same area in front of the rover). We note that MNCC bits have a minor effect on
speed. Instead, speed decreases considerably with the increase of the MNCC mask over Plane rowsratio, which corresponds to the overhead of re-filling the deep pipeline and MNCC buffers each
time we process a new band. Plane rows trade speed for RAMBs, i.e., the increase from 30 to 40 rows
results in approx. 33% more RAMBs with only 7% time gain. Notice that, among all FPGA resources
(LUTs, DFFs, RAMBs, DSPs), we pay particular attention to decreasing the memory due to the
limited RAM available on-chip the FPGAs (especially in NG-MEDIUM, where the total amount is
only 2.7 Mbits, or 56 RAMB48s). Therefore, given that even the small bands respect our time budget
(except, e.g., 10 Plane rows with 13 × 13 or 15 × 15 MNCC mask), we opt for 30 Plane rows. Overall,we select the “44-30-13-13” configuration as the most suitable for meeting all ESA requirements
with sufficient safety margins, especially with respect to accuracy.
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers 16:17
Table 4. HW cost analysis of the final psweep configuration.
FPGA resources on Virtex 6 (xc6vlx240t)components LUTs DFFs RAMB36s DSPscontrol unit 115 (1%) 51 (1%) 0 (0%) 0 (0%)
address generator 498 (1%) 427 (1%) 1 (1%) 0 (0%)
storage module 783 (1%) 812 (1%) 28 (6%) 0 (0%)
pixel interpolation 437 (1%) 594 (1%) 0 (0%) 8 (1%)
MNCC similarity 2,293 (1%) 2,481 (1%) 12 (2%) 45 (5%)
updating module 1,322 (1%) 1,382 (1%) 59 (14%) 0 (0%)
total 5,448 (3%) 5,747 (1%) 100 (24%) 53 (6%)
Table 5. HW comparison to relevant works in the literature (PDS=resolution·disparities·fps, eff =PDS/LUT).
publication [Lentaris et al. 2012] [Tomasi et al. 2012] [Wang et al. 2015] [Cocorullo et al. 2016] current work
cost (RAMB/LUT) 109 / 8.5K 99 / 58K ∼830 / 222K 32 / 70K 100 / 5.5K
speed (PDS) 109M 4505M 10472M 1253M 222M
HW efficiency 13K 78K 47K 18K 40K
4.4 FPGA Implementation Results & Comparison to Literature
The single-FPGA cost of the selected psweep configuration (specifically 44-30-13-13) is analyzed in
Table 4. The most demanding modules in terms of logic are “similarity” (due to the MNCC formula
complexity) and “updating” with depth interpolation, whereas the most memory demanding are
“storage” and “updating”. We note that while these utilization ratios appear small in the commercial
Virtex6, they however challenge the less capable space-grade FPGAs. The max clock frequency is
300 MHz on xc6vlx240t-2, where a 1120x1120 stereo pair of 301 depths completes in 1.7 sec.
Compared to existing works in the literature, the proposed psweep proves HW efficient and very
low-cost (Table 5). For example, when implementing gad [Lentaris et al. 2012] with a 13 × 13 mask
(for fairness to MNCC mask), then the LUT+RAMB cost of psweep becomes smaller than gad’s
(8.5K LUTs, 109 RAMB36) with almost half execution time (gad computes two depth maps for
bi-directional consistency checking) and better accuracy (cf. Sec. 3.3). Secondarily, gad utilizes zero
DSPs due to its less complex SAD-like similarity metric (for comparison purposes, the cost of psweep
implemented without DSPs is 12.6K LUTs). Moreover, at the algorithmic level, psweep is more
configurable compared to gad (has parametric granularity of depth and plane resolution). Compared
to the stereo core of [Wang et al. 2015], psweep provides the same HW efficiency= throughput / logicwhen configured for fairness at 128 integer depths and 3.8 Mpix image resolution, which results
in approximately 1 FPS at 300 MHz with 41x less LUTs than [Wang et al. 2015] (and almost 5x
less on-chip memory). Similar conclusions for our HW efficiency are drawn when refining this
ratio to use the PDS throughput (the PDS metric combines resolution and depths and FPS). psweep
exchanges ∼6x PDS for ∼12x less LUTs compared to [Cocorullo et al. 2016], whereas it utilizes
∼11x less LUTs and half DSPs than that of [Tomasi et al. 2012] (the increased HW efficiency of
which comes at the expense of decreased reconstruction accuracy, as indicated by its performance
on the Middlebury images). This order of magnitude in LUT decrease emerges also versus the
Virtex6 implementation of [Jin and Maruyama 2014] (∼22x), the Kintex7 implementation of [Ttofis
et al. 2015] (∼18x), or even the Altera implementation of [Shan et al. 2014]. Therefore, given the
above comparisons, it becomes clear that psweep achieves state-of-the-art HW efficiency with the
lowest cost implementation (by one order of magnitude), which makes it more suitable for use on
resource-constrained, space-grade FPGAs (meets the goal/trade-off explained at the end of Sec. 2.2).
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
16:18 Lentaris, G. et al.
PartitioningMethodology
Synthesis &Implementation
Proper operation?
SynchronizationMethodoofy
Partition changes?
no yes
no
yes
End
Start
Fig. 9. Overview of proposed methodology for multi-FPGA implementation.
5 MULTI-FPGA IMPLEMENTATION
The last part of our work focuses on the implementation of psweep on actual space-grade technology
and, in particular, on providing working solutions with limited-size FPGA devices. We note that,
by exploiting the resource optimization of our previous section, we can now target rad-hard-by-design FPGAs, which support reliable embedded system development without additional mitigation
techniques, e.g., Triple Modular Redundancy. Despite the existence of a relatively big rad-hard
FPGA, namely the Xilinx Virtex-5QV, certain situations mandate the use of even smaller devices,
such as the Virtex-4QV (e.g., due to heritage issues), the Microsemi ProASIC3 (e.g., due to low-power
constraints), or the latest NanoXplore NG-MEDIUM (e.g., for reliance on European HW). However,
these devices provide a very limited amount of resources and cannot support the entire psweep,
especially when the embedded system/application is expected to include additional HW functions.
When considering limited-size devices, we propose the custom methodology for multi-FPGA im-
plementation shown in Fig. 9. It consists of two inter-dependent parts: the Partitioning, which refers
to the sequence of steps required to partition the design in multiple devices, and the Synchronization,which gradually establishes correct communication/synchronization among the devices.
5.1 Partitioning Methodology
Similar to HW/SW partitioning, the HW/HW partitioning involves analysis of the algorithm/design,
exploration of the HW platform and careful decisions. We identify 3 steps:
(1) Analysis of the complexity per design component: perform fine-grain estimation of
resources (LUTs, RAMBs, DSPs and I/Os) for each component/function to facilitate informed
decision making throughout the entire process.
(2) Partitioning and Mapping: based on the analysis, we manually select the best mapping of
components to devices. The goal is to perform a balanced partitioning of the design in terms
of resources. Furthermore, we seek to minimize the number of interconnections (cut size)
between the partitions/devices while respecting pin constraints (e.g., number), in order to
simplify the integration of the system and decrease power dissipation (less I/O pin utilization).
(3) Trace assignment: we determine the interconnections between the devices, i.e., specific
pins and board traces, by considering the characteristics of the underlying platform (e.g.,
PCB architecture) and the available cables/connectors to support the connectivity of the
FPGAs. We manually seek to employ traces/cables with matching characteristics (e.g., length,
resistance) to decrease transfer delay variations in the network and avoid de-synchronization.
5.2 Synchronization Methodology
When targeting high-performance multi-FPGA implementations instead of mere ASIC prototyping,
synchronizing the high-speed signals among the devices becomes very challenging. Most often, the
default implementation of the EDA tools becomes non-functional at high clock rates. To overcome
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers 16:19
problems such as bus skew, ringing effects, jitter, etc., we examine the factors that increase suscepti-
bility to de-synchronization and develop our proposed guidelines accordingly. Prior to enumerating
the methodology steps, we discuss various issues encountered during our implementation testing.
Distribution of the clock. The choice is between system-synchronous and source-synchronous
distribution. In a system-synchronous design, the clock is provided directly from the PLL oscillator
of the board to all devices simultaneously. In a source-synchronous design, the clock is provided by
the transmitting to the receiving device, i.e., the clock signal travels alongside the data suffering
from similar effects/delays on the board and, thus, limiting the propagation variability between
clock and data. Our experiments with both schemes favored the system-synchronous distribution,
as it proved more reliable on the employed multi-FPGA platform (Synopsys HAPS [Synopsys 2017]).
Digital clock managers. DCMs (also named MMCM) are used to implement various functions
such as delay locked loop (DLL), digital frequency synthesis (DFS) and digital phase shifting (DPS).
One of the main benefits exploited in this work was the de-skewing of the input clock (DLL) for
aligning the internal clock of the FPGA with the external (incoming) clock. Furthermore, DCMs
can be used to address the bus skew problem occurring when multiple data signals travel between
the devices (e.g., violation in setup times due to variation of the transfer delays); by shifting the
clock phase between the devices via the DCM, the rising-edge position can be adjusted so that the
arriving data signals can be sampled without timing violations. However, phase shifting is effective
only when the variation of delays is relatively small compared to the clock period.
Trace reassignment. The communication among multiple FPGAs can be realized with various
types of board traces and/or support cables (I/O extension), which are described by different elec-
trical characteristics, such as wire resistance, length, etc. Combining different type of traces/cables
introduces de-synchronization. Instead, we must match the selected traces towards homogeneous
transfer delays of signals.
Packing of registers to I/O blocks. De-synchronization also occurs due to bus skew internally to
the FPGA, i.e., due to the diverse lengths of the interconnection nets (I/O nets). To overcome this
problem, we pack the registers in the I/O blocks. Assuming sufficient available registers in the
design, automatic I/O packing leads to nets of almost equal length and balanced propagation delays.
Insertion of extra I/O registers (delay lines). Assuming that extra registers do not violate the re-
timing constraints of the digital design to de-synchronize it at cycle level, the technique is useful
for two reasons. First, in case of limited number of utilized registers, for packing the extra registers
in I/O blocks as described above. Second, in case of general timing violations, the extra registers in
the I/Os paths facilitate the placement and routing (PAR) tools in meeting the plethora of different
requirements for each distinct net.
Specification of the I/O Standard. The I/O Standard (e.g. LVCMOS, LVDCI, LVTTL, HSTL) defines
the electrical behavior of the input receivers and the output drivers of the FPGA. It determines
specific characteristics, such as the output drive voltage (e.g., 1.2V, 1.8V, 2.5V, 3.3V), the slew rate
(e.g., fast, slow) and the drive strength (e.g., 6, 8, 12). Carefully exploring this parameter and selecting
same standards among devices is a crucial customization with respect to the board’s capabilities.
Considering all the above, we devised the synchronization methodology depicted in Fig. 10. We
present a sequence of steps, which are triggered when the previous fail to achieve synchronization.
The first and last steps/boxes, essentially, refer to the partitioning of section 5.1 (to connect the 2
methodologies of Fig. 9). The final step implies that all other techniques have failed and a refinement
of partitioning is required, most probably, at the expense of resource balancing. For such refinement,
we usually start form the previous partitioning and move back/forth the cut(s) of the computational
graph until, e.g., we decrease the cut-size to derive a smaller number of traces and/or traffic between
the nodes/devices that could not be synchronized (to make their communication less demanding).
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
16:20 Lentaris, G. et al.
Examine I/O Standards, BoardTraces, Clock Distribution
Partitioning Methodology (Initial implementation)
Insert DCMs
Customize Clock Phases
Insert Extra Registers
Pack Registers to IOBs
Customize Clock Phases
Explore I/O Standards, BoardTraces, Clock Distribution
Partitioning Methodology (Optimized, e.g., w.r.t. cut-size)
Fig. 10. Proposed multi-FPGA synchronization methodology.
Overall, our proposed synchronization steps are summarized as follows (in Fig. 10, we highlight in
orange the steps giving rise to an increase of FPGA resources):
(1) Initial Partitioning: We begin with the solution provided by 3rd
party tools (e.g., Synopsys),
which includes automatic timing constraints for realizing the communication among devices.
(2) Examine I/O Standard, Board Traces and Clock Distribution: Repetitively, at tool level, we
alternate among the most promising I/O signaling combinations depending on the capabilities
of the underlying multi-FPGA board.
(3) Insert DCMs: At HDL level, we place DCMs in all FPGAs to remove the clock delay/skew.
(4) Customize Clock Phases: At tool level, we perform clock phase shifting between the devices
via the DCMs, e.g., at 90o, 180
o, or 270
o, for the reasons explained above.
(5) Pack Registers to I/Os: At tool level, we select the packing of registers in the I/O blocks
aiming to internally minimize and balance the FPGA I/O net delays.
(6) Insert Extra Registers: At HDL level, by respecting the digital operation schedule (cycle-wise),
we insert registers at the I/O of partitioned components to facilitate further register packing.
(7) Explore I/O Standard, Board Traces and Clock Distribution: Similar to step 2, but performed
thoroughly to explore all possible combinations and board performances.
(8) Refine Partitioning: We manually derive a new partitioning towards minimizing the commu-
nication load between problematic devices/nodes (traces and/or transactions).
In conclusion, the best approach tomulti-FPGA design was the “semi-automatic” one.When using
the default output of the “fully-automatic” tools, we usually derived non-functional implementations,
especially at high clock rates. With the “manual” approach, i.e., partitioning and assignment done
entirely by hand, the effort increased disproportionally and the design became prone to bugs due
to programming with codewords from datasheets, etc. Instead, the “semi-automatic” approach
combines the best of both worlds, i.e. the GUI automation with the human control/guidance.
5.3 Experimental Setup
To validate our methodology and demonstrate the relevant multi-FPGA solutions for psweep, we
set up a tool-flow, a HW platform, and a limited-size space-grade device. As a proof-of-concept,
we assume the latest NanoXplore NG-MEDIUM rad-hard device [Le Mauff 2018] and we use the
HAPS-54 prototyping platform [Synopsys 2017]. The European NG-MEDIUM device is built on
STM 65nm rad-hard technology and provides 34K LUT4, 32K DFF, 112 DSPs, with 56 RAMB48 (2.7
Mbits, not sufficient for psweep). The Synopsys HAPS-54 multi-FPGA board (Fig. 12, left) consists
of four Xilinx Virtex-5 LX330 FPGAs (built on 65nm technology like NG-MEDIUM, with 207K
LUT6, 207K DFF, 192 DSP, 288 RAMB36) and a number of predefined and flexible interconnections.
The limited number of on-board fabricated tracks, e.g., 354 fast or 110 slow or 238 global traces,have varying electrical characteristics and transmission delays, e.g., 1.2ns or 2.8ns. Hence, they
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers 16:21
Fig. 11. Multi-FPGA implementation tool flow.
form a suitable testbed for evaluating our synchronization techniques. Also, we can employ various
external plug-in connectors/cables, e.g., con_2x1 or con_cable with ∼4ns delay. Overall, we emulate
the limited-size device challenge on HAPS-54 by constraining the VLX330 utilization according to
the NG-MEDIUM specifications.
Regarding the EDA tools, Fig. 11 illustrates the entire tool flow and the required inputs. The
inputs refer to device-specific files for the instantiation of various IPs (e.g., memories, DCMs),
source code files (VHDL, Verilog), constraint files for all the FPGAs (timing and placement) and
dedicated board files that define various parameters of the employed board (e.g., HAPS-54) such as
the reset and clock distribution. In the first stage of the tool chain, we use the Synopsis Certify
tool to perform the partitioning of the design and the trace assignment to interconnect the FPGAs.
In the second stage, we use the Synopsis Synplify tool to separately synthesize for each device
every partition derived from Certify and generate the corresponding netlist file. Moreover, in this
stage, we manually perform optional customizations in the design according to the aforementioned
methodology (insert registers, DCMs, etc.). Finally, Xilinx ISE is required to perform place & route
(PAR) for each FPGA and generate the corresponding bitstream. The necessary inputs are the
synthesized netlist and a file containing the placement/timing constraints retrieved from Synplify.
5.4 Implementation Results
To put things in perspective, psweep was first implemented on a single rad-hard FPGA, i.e., on
Xilinx Virtex-5QV (total size 82K LUTs, 320 DSPs, 298 RAMB36) with ISE 14.7. Our final psweep
engine fits in a single device together with an 100 Mbps Ethernet controller, which includes local
buffers for pipelining the FPGA processing and CPU communication steps instead of stalling our
HW engine. Specifically, psweep consumes 5.9K LUTs, 53 DSPs, and 108 RAMB36, whereas the
Ethernet arbiter utilizes 3K LUTs and 55 RAMB36. The maximum clock frequency reported by
Xilinx Timming Analyzer is 166 MHz, i.e., the 2.5 Mpixel stereo image can be processed in 3.1 sec.
Based on these results and extrapolations that also regard the European FPGAs [Le Mauff 2018],
we estimate that, in such space-grade chips, psweep will consume power in the area of 4–10 watts,
which is today considered as an acceptable budget for the rover applications.
Next, we tested a baseline implementation of psweep on HAPS-54 by utilizing a single VLX330
device. The communication between the host PC and the HAPS-54 board was realized via Ethernet.
That is, we integrated the aforementioned custom Ethernet arbiter and we employed a HAPS
daughter-board, namely GEPHY, to realize an 81 Mbps Ethernet link (actual measured bandwidth).
The FPGA resource utilization is 9346 LUT6 (5%), 8346 DFF (4%), 312 RAMB18 (54%) and 53 DSP
(28%). In practice, by using the on-board clock switches, the maximum clock frequency increased
up to 280 MHz. The correctness of the HW results was validated by comparing to the VHDL
simulator’s results (test vectors). Hence, the functionality analysis performed in sections 3.3 and
4.3 also apply to our multi-FPGA implementations. Notice that the memory utilization exceeds the
capacity of NG-MEDIUM, and therefore, we proceed to our multi-FPGA approach.
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
16:22 Lentaris, G. et al.
Fig. 12. The HAPS-54 test platform, with 4 XC5VLX330 FPGAs and daughter-boards used for multi-FPGA
evaluation (left). Four distinct partitioning schemes utilizing 2–4 FPGAs on the HAPS-54 platform (right).
Table 6. Resources of 2- and 3-FPGA implementations, with interconnection and MHz measured on-board.
Resources Double-FPGA Design Triple-FPGA Design(XC5VLX330) Device A Device B Device A Device B Device C
LUTs 2913 (1%) 6068 (3%) 2911 (1%) 3052 (1%) 2851 (1%)
DFFs 2632 (1%) 6170 (3%) 2680 (1%) 3125 (2%) 2665 (1%)
RAMB18s 109 (19%) 203 (35%) 108 (19%) 174 (30%) 27 (5%)
DSPs 0 (0%) 55 (29%) 0 (0%) 10 (5%) 45 (23%)
Interconnections 59 bits 82 bits
Max. Frequency 280 MHz 224 MHz
Intercon. Activity 0.08 Gbps 5 Gbps
To assess our methodology, we assumed multiple partitioning scenarios and tested all result-
ing implementations. In all cases, we paid particular attention to the selection of the routing
resources for realizing the inter-FPGA network and we concluded that increasing the number of
traces/connections increases the possibility of de-synchronization (either from the start, or during
the course of processing the images). Among all successful implementations, Fig. 12, right, illustrates
4 representative partitioning topologies utilizing two, three and four FPGAs. The blue-colored
buses indicate the I/O data transfers (e.g., pixels), while the black-colored arrows correspond to
the control signals. Tables 6 and 7 report the VLX330 utilization, together with the total intercon-
nection traces of each network, the maximum clock rate measured on the board, and the total
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers 16:23
Table 7. Resources of 4-FPGA implementations in comm-isolated and ring topologies.
Resources Quadruple-FPGA Design Ring Quadruple-FPGA Design(XC5VLX330) Device A Device B Device C Device D Device A Device B Device C Device D
LUTs 2275 (1%) 2907 (1%) 1588 (1%) 1329 (1%) 2818 (1%) 1741 (1%) 3482 (2%) 1224 (1%)
DFFs 2560 (1%) 3124 (2%) 1558 (1%) 1122 (1%) 2687 (1%) 1832 (1%) 2665 (1%) 1237 (1%)
RAMB18s 109 (19%) 176 (31%) 24 (4%) 0 (0%) 109 (19%) 58 (10%) 26 (5%) 117 (20%)
DSPs 0 (0%) 10 (5%) 45 (23%) 0 (0%) 0 (0%) 10 (5%) 45 (23%) 0 (0%)
Interconnect. 187 bits 93 bits
Max. Freq. 172 MHz 224 MHz
Inter. Activity 22 Gbps 5.1 Gbps
information transferred among devices (interconnection activity). We note that a rough estimation
of the NG-MEDIUM utilization can be made via simple comparison to its available resources (34K
LUT4, 32K DFF, 112 DSPs, 56 RAMB48 with 2.7 Mbits).
The most balanced partitioning was achieved with the quadruple implementation in a “ring”
topology (Table 7). To achieve high clock rates in this ring, i.e., 224 MHz, we avoided common
broadcasting lines for control signals by using duplicates, and synced the 93 traces to sustain 5.1
Gbps flow on the board. This figure is better than the excessive 22 Gpbs of the first quadruple
topology (Table 7). Due to its pipelined operation, the ring’s B-C-D devices work in parallel and
sustain 99% time utilization (engaged in processing). The 3-FPGA design is also very competitive,
with very regular interconnections (Table 6). Still, it cannot fit in NG-MEDIUM due to insufficient
memory, a frequent bottleneck for image processing on FPGAs. When considering NG-MEDIUM,
the only feasible scenario for psweep is the 4-FPGA in ring topology, with a maximum utilization
of 78% RAM (in device D, for the map), 10% LUT (in C, for MNCC), and 40% DSP (in C).
A fair comprehensive comparison to previous works is infeasible due to the scarce number of
similar multi-FPGA publications. Nevertheless, compared to [Choi and Rutenbar 2016] which also
uses 4 VLX330 for stereo vision, our methodology allowed a much more fine-grained partitioning
of the design and higher clock rates instead of simple function replication for mere frame-level
parallelization. Similarly, compared to [Lee et al. 2013] that uses HAPS-64 for ray-tracing, our
methodology has led to better inter-FPGA communication by respecting pin constraints and
avoiding extra HW resources, e.g., AXI-AHB bridges to interface the FPGAs, even thoughwe achieve
up to 22 Gbps being transferred among the chips. Overall, due to its demonstrated optimizations,
our multi-FPGA methodology can provide significant acceleration of stereo processing, even when
project-specific constraints mandate the use of small devices.
6 CONCLUSION
In an effort to enhance the embedded stereo vision capabilities of future planetary rovers, this
paper has presented the development and hardware acceleration of psweep, a plane sweep variant.
psweep was experimentally demonstrated to accurately reconstruct in 3D various Mars-like scenes
with mean error less than 2 cm at 4 m depth (and less than 8 mm if large errors are excluded).
Based on pixel-level pipelining, parallel architecture design, on-the-fly processing and fine-tuning
via parametric VHDL, our FPGA psweep maintains the accuracy of the SW implementation while
processing a 2.5 Mpixel stereo image in only 1.7 sec. With a cost of only 5.4K LUT and 100 RAMB36
on a xc6vlx240t-2, this FPGA achieves speed-up factors of 32x compared to a desktop CPU (Intel
core i5-4590) and 2810x against a space-grade processor (LEON3 at 50MHz). When considering
space-grade technology, our HWminimization techniques allow psweep plus an Ethernet controller
to fit in a single Virtex5QV and process a 2.5 Mpixel image in 3.1 sec. Furthermore, to leverage the
use of limited-size space FPGAs, we devised a custom methodology for multi-FPGA partitioning
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
16:24 Lentaris, G. et al.
and demonstrated various implementations on sets of 2−4 example devices, e.g., NG-MEDIUM
emulated on the HAPS-54 multi-FPGA platform. Our final HW/SW embedded system meets all
the demanding requirements set by ESA in terms of accuracy, speed and HW cost. Compared to
similar published works, our FPGA solution proves HW-efficient and very low-cost, hence suitable
for supporting autonomous rover navigation in planetary exploration scenarios.
ACKNOWLEDGMENTS
The authors thank Marcos Avilés Rodrigálvarez from GMV, Spain for porting the C code on LEON3,
as well as Gianfranco Visentin from ESTEC/ESA, the Netherlands for useful discussions. This work
was supported by the European Space Agency via the SEXTANT and COMPASS projects of the
ETP-MREP research programme (ESTEC refs. 4000103357/11/NL/EK and 4000111213/14/NL/PA).
REFERENCES
Kristian Ambrosch and Wilfried Kubinger. 2010. Accurate Hardware-Based Stereo Vision. Computer Vision and ImageUnderstanding 114, 11 (2010), 1303–1316.
Max Bajracharya, Mark W. Maimone, and Daniel Helmick. 2008. Autonomy for Mars Rovers: Past, Present, and Future.
Computer 41, 12 (Dec. 2008), 44–50.Christian Banz et al. 2010. Real-Time Stereo Vision System Using Semi-Global Matching Disparity Estimation: Architecture
and FPGA-Implementation. In Int’l Conf. on Embedded Comp. Sys.: Architectures, Modeling & Simulation. 93–101.Jungwook Choi and Rob A. Rutenbar. 2016. Video-Rate Stereo Matching Using Markov Random Field TRW-S Inference on a
Hybrid CPU+FPGA Computing Platform. IEEE Trans. Circuits Syst. Video Technol. 26, 2 (2016), 385–398.Giuseppe Cocorullo, Pasquale Corsonello, Fabio Frustaci, and Stefania Perri. 2016. An Efficient Hardware-Oriented Stereo
Matching Algorithm. Microprocessors and Microsystems 46 (2016), 21–33.Robert T. Collins. 1996. A Space-Sweep Approach to True Multi-Image Matching. In Proc. Conf. on Computer Vision and
Pattern Recognition (CVPR’96). 358–363.Ahmad Darabiha, W. James MacLean, and Jonathan Rose. 2006. Reconfigurable Hardware Implementation of a Phase-
Correlation Stereo Algorithm. Machine Vision and Applications 17, 2 (May 2006), 116–132.
Alex Ellery. 2015. Planetary Rovers: Robotic Exploration of the Solar System. Springer Berlin Heidelberg.
David Gallup, Jan-Michael Frahm, PhilipposMordohai, Qingxiong Yang, andMarc Pollefeys. 2007. Real-Time Plane-Sweeping
Stereo with Multiple Sweeping Directions. In Proc. Conf. on Computer Vision and Pattern Recognition (CVPR’07). 1–8.David Gallup, Jan-Michael Frahm, Philippos Mordohai, and Marc Pollefeys. 2008. Variable Baseline/Resolution Stereo. In
Proc. Conf. on Computer Vision and Pattern Recognition (CVPR’08). 1–8.Pierre Greisen, Simon Heinzle, Markus Gross, and Andreas P. Burg. 2011. An FPGA-Based Processing Pipeline for High-
Definition Stereo Video. EURASIP Journal on Image and Video Processing 2011, 1 (2011), 18.
Heiko Hirschmüller and Daniel Scharstein. 2009. Evaluation of Stereo Matching Costs on Images with Radiometric
Differences. IEEE Trans. Pattern Anal. Mach. Intell. 31, 9 (2009), 1582–1599.Minxi Jin and Tsutomu Maruyama. 2014. Fast and Accurate Stereo Vision System on FPGA. ACM Trans. Reconfigurable
Technol. Syst. 7, 1 (Feb. 2014), 1–24.Seunghun Jin et al. 2010. FPGA Design and Implementation of a Real-Time Stereo Vision System. IEEE Trans. Circuits Syst.
Video Technol. 20, 1 (Jan. 2010), 15–26.Ioannis Kostavelis et al. 2014. SPARTAN: Developing a Vision System for Future Autonomous Space Exploration Robots. J.
Field Robotics 31, 1 (2014), 107–140.Joel Le Mauff. 2018. From eFPGA cores to RHBD System-On-Chip FPGA (NanoXplore’s presentation of the NG-MEDIUM
rad-hard FPGA). https://indico.esa.int/event/232/contributions/2137/attachments/1820/2121/2018-04_NX-From_eFPGA_
cores_to_RHBH_SoC_FPGAs-JLM-v2.pdf. (2018). 4th SEFUW workshop, ESTEC/ESA, Noordwijk, NL, 9 April 2018.
Jaedon Lee, Youngsam Shin, Won-Jong Lee, Soojung Ryu, and Jeongwook Kim. 2013. Real-time Ray Tracing on Coarse-
Grained Reconfigurable Processor. In IEEE Int’l Conf. on Field-Programmable Technology (FPT). 192–197.George Lentaris, Dionysios Diamantopoulos, Kostas Siozios, Dimitrios Soudris, and Marcos Avilés Rodrigálvarez. 2012.
Hardware Implementation of Stereo Correspondence Algorithm for the ExoMars Mission. In IEEE Int’l Conf. on FieldProgrammable Logic and Applications (FPL). 667–670.
George Lentaris, Konstantinos Maragos, Ioannis Stratakos, Lazaros Papadopoulos, Odysseas Papanikolaou, Dimitrios Soudris,
Manolis Lourakis, Xenophon Zabulis, David Gonzalez-Arjona, and Gianluca Furano. 2018. High Performance Embedded
Computing in Space: Evaluation of Platforms for Vision-Based Navigation. J. Aerosp. Inf. Syst. 15, 4 (April 2018), 178–192.George Lentaris, Ioannis Stamoulias, Dimitrios Soudris, and Manolis Lourakis. 2016. HW/SW Co-design and FPGA
Acceleration of Visual Odometry Algorithms for Rover Navigation on Mars. IEEE Trans. Circuits Syst. Video Technol. 26,
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.
Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers 16:25
8 (Aug. 2016), 1563–1577.
Jing Liu, Chunpeng Li, Feng Mei, and Zhaoqi Wang. 2015. 3D Entity-Based Stereo Matching With Ground Control Points
and Joint Second-Order Smoothness Prior. The Visual Computer 31, 9 (Sept. 2015), 1253–1269.Manolis Lourakis and Xenophon Zabulis. 2013. Accurate Scale Factor Estimation in 3D Reconstruction. In Intl. Conf. on
Computer Analysis of Images and Patterns (CAIP). Springer Berlin Heidelberg, 498–506.
Mark W. Maimone, P. Chris Leger, and Jeffrey J. Biesiadecki. 2007. Overview of the Mars Exploration Rovers’ Autonomous
Mobility and Vision Capabilities. In Int’l Conf. on Robot. Autom. (ICRA), Space robotics workshop.Larry Matthies et al. 2007. Computer Vision on Mars. Int. J. Comput. Vision 75, 1 (2007), 67–92.
Richard H. Maurer, Martin E. Fraeman, Mark N. Martin, and David R. Roth. 2008. Harsh Environments: Space Radiation
Environment, Effects, and Mitigation. Johns Hopkins APL Technical Digest 28, 1 (2008), 17–29.Hans P. Moravec. 1977. Towards Automatic Visual Obstacle Avoidance. In Proc. Int’l Joint Conf. on AI (IJCAI). 584–594.Mikhail G. Mozerov and Joost van de Weijer. 2015. Accurate Stereo Matching by Two-Step Energy Minimization. IEEE
Trans. Image Process. 24, 3 (March 2015), 1153–1163.
Kyprianos Papadimitriou, Sotiris Thomas, and Apostolos Dollas. 2013. An FPGA-Based Real-Time System for 3D Stereo
Matching, Combining Absolute Differences and Census with Aggregation and Belief Propagation. In IFIP/IEEE Int’l Conf.on VLSI-SoC. Springer, 168–187.
Madaín Pérez-Patricio and Abiel Aguilar-González. 2015. FPGA Implementation of an Efficient Similarity-Based Adaptive
Window Algorithm for Real-time Stereo Matching. Journal of Real-Time Image Processing (Sept. 2015).
Marc Pollefeys et al. 2008. Detailed Real-Time Urban 3D Reconstruction from Video. Int. J. Comput. Vision 78, 2-3 (July
2008), 143–167.
Marc Pollefeys and Sudipta Sinha. 2004. Iso-Disparity Surfaces for General Stereo Configurations. In Proc. Europ. Conf. onComputer Vision ECCV, Vol. III. Springer, 509–520.
Kyle Rupnow, Yun Liang, Yinan Li, Dongbo Min, Minh Do, and Deming Chen. 2011. High level synthesis of stereo matching:
Productivity, performance, and software constraints. In Field-Programmable Technology (FPT), 2011 Int’l Conf. IEEE, 1–8.Daniel Scharstein and Richard Szeliski. 2002. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence
Algorithms. Int. J. Comput. Vision 47, 1-3 (April 2002), 7–42.
Daniel Scharstein and Richard Szeliski. 2015. Middlebury Stereo Evaluation - Version 2. http://vision.middlebury.edu/stereo/
eval/. (2015). Accessed 2017-11-21.
Thomas Schöps, Torsten Sattler, Christian Häne, and Marc Pollefeys. 2017. Large-Scale Outdoor 3D Reconstruction on a
Mobile Device. Computer Vision and Image Understanding 157 (2017), 151–166.
Yi Shan et al. 2014. Hardware Acceleration for an Accurate Stereo Vision System Using Mini-Census Adaptive Support
Region. ACM Trans. Embed. Comput. Syst. 13, 4 (April 2014), 132:1–132:24.Synopsys. 2017. HAPS Family of Physical Prototyping Solutions. https://www.synopsys.com/verification/prototyping/haps.
html. (2017). Accessed 2017-07-10.
Matteo Tomasi, Mauricio Vanegas, Francisco Barranco, Javier Diaz, and Eduardo Ros. 2012. Real-Time Architecture for a
Robust Multi-Scale Stereo Engine on FPGA. IEEE Trans. on VLSI Syst. 20, 12 (Dec. 2012), 2208–2219.Christos Ttofis, Christos Kyrkou, and Theocharis Theocharides. 2015. A Hardware-Efficient Architecture for Accurate
Real-Time Disparity Map Estimation. ACM Trans. Embed. Comput. Syst. 14, 2 (Feb. 2015), 36:1–36:26.George Vogiatzis, Carlos Hernández Esteban, Philip H.S. Torr, and Roberto Cipolla. 2007. Multiview Stereo via Volumetric
Graph-Cuts and Occlusion Robust Photo-Consistency. IEEE Trans. Pattern Anal. Mach. Intell. 29, 12 (Dec. 2007), 2241–2246.Wenqiang Wang, Jing Yan, Ningyi Xu, Yu Wang, and Feng-Hsiung Hsu. 2015. Real-Time High-Quality Stereo Vision System
in FPGA. IEEE Trans. Circuits Syst. Video Technol. 25, 10 (Oct. 2015), 1696–1708.Qingxiong Yang et al. 2009. Stereo Matching with Color-Weighted Correlation, Hierarchical Belief Propagation, and
Occlusion Handling. IEEE Trans. Pattern Anal. Mach. Intell. 31, 3 (March 2009), 492–504.
Qingqing Yang, Pan Ji, Dongxiao Li, Shaojun Yao, and Ming Zhang. 2014. Fast Stereo Matching Using Adaptive Guided
Filtering. Image Vision Comput. 32, 3 (March 2014), 202–211.
Ruigang Yang and Marc Pollefeys. 2003. Multi-Resolution Real-Time Stereo on Commodity Graphics Hardware. In Proc.Conf. on Computer Vision and Pattern Recognition (CVPR’03). 211–217.
Kuk-Jin Yoon and In So Kweon. 2006. Adaptive Support-Weight Approach for Correspondence Search. IEEE Trans. PatternAnal. Mach. Intell. 28, 4 (April 2006), 650–656.
Xenophon Zabulis, Georgios Kordelas, Karsten Müller, and Aljoscha Smolic. 2006. Increasing the Accuracy of the Space-
Sweeping Approach to Stereo Reconstruction, Using Spherical Backprojection Surfaces. In Proc. Int’l Conf. on ImageProcessing (ICIP). 2965–2968.
Yunlong Zhan, Yuzhang Gu, Kui Huang, Cheng Zhang, and Keli Hu. 2016. Accurate Image-Guided Stereo Matching With
Efficient Matching Cost and Disparity Refinement. IEEE Trans. Circuits Syst. Video Technol. 26, 9 (Sept. 2016), 1632–1645.Paolo Zicari, Stefania Perri, Pasquale Corsonello, and Giuseppe Cocorullo. 2012. Low-Cost FPGA Stereo Vision System for
Real Time Disparity Maps Calculation. Microprocessors and Microsystems 36, 4 (2012), 281–288.
ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.