Single- and multi-FPGA Acceleration of Dense...

16

Single- and multi-FPGA Acceleration of Dense Stereo Vision

for Planetary Rovers

GEORGE LENTARIS, KONSTANTINOS MARAGOS, DIMITRIOS SOUDRIS, Department of

Electrical and Computer Engineering, National Technical University of Athens (NTUA), Greece

XENOPHON ZABULIS, MANOLIS LOURAKIS, Institute of Computer Science, Foundation for

Research and Technology – Hellas (FORTH), Greece

Increased mobile autonomy is a vital requisite for future planetary exploration rovers. Stereo vision is a

key enabling technology in this regard, as it can passively reconstruct in 3D the surroundings of a rover and

facilitate the selection of science targets and the planning of safe routes. Nonetheless, accurate dense stereo

algorithms are computationally demanding.When executed on the low-performance, radiation-hardened CPUs

typically installed on rovers, slow stereo processing severely limits the driving speed and hence the science it

can be conducted in situ. Aiming to decrease execution time while increasing the accuracy of stereo vision

embedded in future rovers, this paper proposes HW/SW co-design and acceleration on resource-constrained,

space-grade FPGAs. In a top-down approach, we develop a stereo algorithm based on the space sweep paradigm,

design its parallel HW architecture, implement it with VHDL and demonstrate feasible solutions even on

small-sized devices with our multi-FPGA partitioning methodology. To meet all cost, accuracy and speed

requirements set by the European Space Agency for this system, we customize our HW/SW co-processor by

design space exploration and testing on a Mars-like dataset. Implemented on Xilinx Virtex technology, or

European NG-MEDIUM devices, the FPGA kernel processes a 1120×1120 stereo pair in 1.7−3.1 sec, utilizingonly 5.4−9.3 LUT6 and 200−312 RAMB18. The proposed system exhibits up to 32x speedup over desktop

CPUs, or 2810x over space-grade LEON3, and achieves a mean reconstruction error less than 2 cm up to 4 m

depth. Excluding errors exceeding 2 cm (which are less than 4% of the total), the mean error is under 8 mm.

Additional Key Words and Phrases: Planetary rovers, autonomous navigation, stereo vision, space sweep,

parallel architecture design, rad-hard FPGA, multi-FPGA partitioning

ACM Reference Format:George Lentaris, Konstantinos Maragos, Dimitrios Soudris and Xenophon Zabulis, Manolis Lourakis. 2019.

Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers. ACM Trans. Embedd.Comput. Syst. 18, 2, Article 16 (April 2019), 25 pages. https://doi.org/10.1145/3312743

1 INTRODUCTION

As confirmed by recent and planned activities, the exploration of planetary bodies, in particular

of Mars, is a priority for all major space agencies. Existing scenarios emphasize a high mobility

autonomous rover, which will perform both in situ exploration and sample-return missions to

further analyze the collected material back on Earth [Bajracharya et al. 2008; Ellery 2015]. Increasing

the navigational autonomy of planetary exploration rovers allows them to explore larger areas

Authors’ addresses: George Lentaris, Konstantinos Maragos, Dimitrios Soudris, Department of Electrical and Computer

Engineering, National Technical University of Athens (NTUA), Zografou, Athens, 15788, Greece, [email protected];

Xenophon Zabulis, Manolis Lourakis, Institute of Computer Science, Foundation for Research and Technology – Hellas

(FORTH), N. Plastira 100, Heraklion, 70013, Greece, [email protected].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specific permission and/or a fee. Request permissions from [email protected].

© 2019 Association for Computing Machinery.

1539-9087/2019/4-ART16 $15.00

https://doi.org/10.1145/3312743

ACM Transactions on Embedded Computing Systems, Vol. 18, No. 2, Article 16. Publication date: April 2019.

https://doi.org/10.1145/3312743

https://doi.org/10.1145/3312743

16:2 Lentaris, G. et al.

during a mission and amass more scientific data. Embedded computer vision is a compelling

technology in this context, due to its reliance on passive cameras with reduced power consumption,

low cost, small size, increased mechanical reliability and longevity [Matthies et al. 2007]. Stereo

vision, in particular, permits a rover to reconstruct its environment in 3D so as to survey it, detect

obstacles or science targets and safely steer itself through unknown, potentially hazardous terrain.

In its archetypal form, binocular stereo involves two images acquired from slightly different

viewpoints and a computational procedure for determining a disparity map, i.e., an association

among matching pixels of the image pair [Scharstein and Szeliski 2002]. A triangulation calculation

determines the depth of a physical point from its disparity, i.e., the distance between its homologous

projections on the two images. Owing to the epipolar constraint which restricts corresponding

pixels to lie on conjugate epipolar lines, stereo matching amounts to a 1D search problem. Even so,

stereo matching is computationally intensive and requires a significant amount of time, especially

when it concerns images of high resolution.

To withstand extreme temperature ranges and space radiation effects [Maurer et al. 2008],

planetary rovers have to utilize space-grade processors, which are 1−2 orders of magnitude slower

than equivalent terrestrial CPUs. The primary reason for this is that due to their high production

and certification costs, space-qualified electronics lag behind commercial technologies by several

fabrication process generations. For example, the MER (Spirit / Opportunity) and MSL (Curiosity)Mars missions [Ellery 2015] use respectively the RAD6000 @20MHz & RAD750 @133MHz rad-

hard CPUs, whose computing capabilities are similar to the PowerPC 601 and PowerPC 750 from

the 1990s. Both missions rely on stereo as a rover’s sole 3D sensing mechanism and schedule

stationary intervals within the rover traverse paths in order to perform the CPU-intensive stereo

processing [Bajracharya et al. 2008; Matthies et al. 2007]. Executing simultaneously with flight

software, a 256×256map of 32 disparities requires about 30 sec to complete onMER rovers [Maimone

et al. 2007; Matthies et al. 2007]. As a result of slow processing, the rovers are controlled through a

prolonged command-execute-telemetry-deploy cycle, which scientists wish reduced [Ellery 2015].

Foreseeing the future, increased accuracy algorithms such as the one proposed in this paper will

increase the stereo workload by three orders of magnitude due to high-definition images, more

depths/disparities examined and more sophisticated matching metrics.

The aforementioned workload poses a great challenge to the task of improving a rover’s stereo

vision accuracy and execution speed for the eventual goal of advancing its autonomy. Practically,

this goal cannot be realized via rad-hard CPUs alone and mandates the use of hardware accelerators.

As demonstrated by our recent comparative analysis involving a large number of diverse processing

platforms as candidate accelerators for space avionics [Lentaris et al. 2018], Field Programmable Gate

Arrays (FPGAs) offer the highest performance per watt among all choices. Hence, they constitute

the most viable option for accelerating computationally demanding operations in space under a

limited power budget. High-density chips, such as Xilinx Virtex-5QV or Microsemi ProASIC3, are

already used in space and can serve as a reliable basis for future HW/SW co-design solutions in

rover navigation. Guided by the European Space Agency (ESA), in this paper we study, design, and

implement a HW/SW embedded system tailored to stereo vision on Martian rovers by utilizing a

rad-hard CPU and single or multiple rad-hard by design FPGAs of various technologies.

At platform level, we assume a generic rover setup [Kostavelis et al. 2014] with two navigation

stereo cameras mounted on a pan-tilt unit (PTU) atop a mast at a height of 1m. The stereo cameras

are placed 20cm apart (baseline distance), each with a resolution of 1120×1120 elements (5.5 µmsquare pixels), 6.6mm focal length and 50

ofield of view. They are parallel to each other, form an

angle of 39owith the horizon and capture a volume from 0.48 m to 4 m on level ground in front of

the rover. The PTU can pan the cameras in three directions (−35o , 0o , 35o ) to acquire three stereo

pairs, jointly covering a field of 120owith 25% overlap. As per the requirements of ESA, such a


Single- and multi-FPGA Acceleration of Dense Stereo Vision for Planetary Rovers 16:3

triple acquisition will take place with the rover being stationary and must complete within 20 sec,

including the time for stereo processing, reorienting the cameras and merging partial results.

Besides the aforementioned 20 sec budget, the requirements set by ESA for the embedded system

also concern accuracy and HW cost. Specifically, the average reconstruction error for the cloud

of output points is required to be less than 2 cm up to 4 m depth (reflecting the fact that stereo is

inherently less accurate for distant targets). This accuracy is to be verified with a synthetic dataset

depicting a Mars-like environment with diffuse lighting, low contrast and a mixture of fine grained

sand, rock outcrops and surface rocks (cf. Fig. 2). In terms of HW cost, the solution should use a

rad-hard LEON processor with 150 MIPS processing power and FPGA(s) of space-grade technology;

the possibility of using one Xilinx Virtex-5QV (82K LUTs, 320 DSPs, 298 RAMB36) or multiple

smaller FPGAs of European origin (e.g., the NanoXplore NG-MEDIUM [Le Mauff 2018], with 34K

LUT4, 112 DSPs, 56 RAMB48) should also be examined.

The contributions of this paper are i) to develop and evaluate a space sweep stereo algorithm

customized to Martian scenarios, ii) to describe the first, to the best of our knowledge, FPGA

implementation of space sweep stereo in the literature, iii) to quantify the benefits of using FPGA

in embedded systems for future rovers by performing HW/SW co-design and developing parallel

HW architectures that achieve 1000x speed-up on space-grade chips and iv) to devise a multi-FPGA

partitioning methodology and demonstrate high-performance results even when employing small

FPGAs. Regarding our algorithmic and HW acceleration innovations, primarily we i) combine

sweeping planes at exponentially increasing distances from the cameras with the modified nor-

malized cross-correlation metric for improved depth accuracy, ii) combine deep-pipelining and

on-the-fly computation with highly-utilized HW reuse instead of trivial pixel/depth parallelization,

iii) rely on parametric VHDL and design space exploration for fine tuning, iv) explore and combine

low-level synchronization techniques for fast multi-FPGA communication. Consequently, the final

solution meets all ESA requirements by exploiting the trade-off between circuit size and execution

time. Instead of maximizing HW parallelization as often done in terrestrial applications with relaxed

HW constraints, our FPGA design employs only 5.4K LUTs to process a stereo pair with 2.5 Mpixels

in less than 3.1 sec, thus allowing a triplet of reconstructions and related tasks (PTU movement,

concatenation, etc) to be completed in the allotted 20 sec with limited HW resources.

Proceeding from the algorithmic to the implementation level, the remainder of the paper is

organized as follows. Section 2 reviews relevant published works. Section 3 presents the algorithmic

details and a SW evaluation on Mars-like datasets. Section 4 presents the proposed HW architecture

and the results of our FPGA implementation. Section 5 discusses our multi-FPGA partitioning

methodology and evaluates the results on space-grade devices. Section 6 concludes the paper.

2 RELATEDWORK

2.1 Binocular Stereo Algorithms

According to the comprehensive taxonomy of [Scharstein and Szeliski 2002], binocular stereo

algorithms are comprised of a combination of the following steps: a) computation of matching cost,

b) aggregation of support, c) disparity computation and d) disparity refinement. Applied to pixel

windows, the matching cost measures the affinity of two images using criteria such as the sum of

squared/absolute differences (SSD/SAD), the normalized cross correlation (NCC) or the modified

NCC (MNCC) [Hirschmüller and Scharstein 2009; Moravec 1977]. Aggregation of support refers

to whether matching scores are combined locally or globally. Local methods aggregate cost by

summing within finite support regions. Thus, they emphasize the cost computation and aggregation

steps and often estimate disparities with a greedy “winner-take-all” strategy which identifies the

minimum cost disparity in each support region. On the other hand, global methods perform most



computations during the disparity computation step and typically lack any aggregation step. The

disparity computation step disambiguates potential matches by solving an optimization problem

under smoothness constraints. Preserving depth discontinuities is one of the major challenges

that needs to be overcome during disparity computation. Refinement of disparities concerns post-

processing operations such as sub-pixel interpolation, identification of occlusions, elimination

of spurious matches or filling of holes at textureless regions. The 3D scene structure is finally

recovered by triangulating matching viewing rays, with metric scale obtained from the knowledge

of the extrinsic calibration parameters [Lourakis and Zabulis 2013].

Compared to local ones, global stereo techniques generally produce the most accurate results.

Nevertheless, the inaccuracies of local methods are mainly exhibited at depth discontinuities or

weakly textured regions and have a limited impact on the overall shape of the reconstructed surface.

As concerns their computational requirements, global methods need large amounts of memory

and have irregular data access patterns. On the other hand, local methods are less computationally

intensive and simpler to implement, therefore are usually preferred for time-constrained appli-

cations. Besides, their extensive data-level parallelism, smaller memory footprint as well as their

restricted data access pattern makes local stereo methods better suited for parallelization on special

computing platforms such as FPGAs and Digital Signal Processors (DSPs).

Plane sweep is a local stereo algorithm that performs multi-image stereo matching with arbitrary

relative camera configurations [Collins 1996]. Originally targeted at the reconstruction of sparse

features, it was later adapted to dense depth estimation [Yang and Pollefeys 2003]. It works by

sweeping a hypothetical plane through a scene andmeasuring the photo consistency of input images

as they are backprojected onto the swept planes, without prior rectification. Image projections

of a scene point at a certain depth should be highly correlated when backprojected on the plane

swept through the particular depth corresponding to their pre-image (Fig. 1, left). Plane sweep is

conceptually simple and amenable to highly parallel hardware implementation that can achieve

superior performance, hence it has been often employed for efficiently reconstructing outdoor

environments [Pollefeys et al. 2008; Schöps et al. 2017]. The accuracy of plane sweep can be

improved by choosing the sweeping direction according to the orientation of dominant structures

in a scene, e.g. the ground plane [Gallup et al. 2007]. Its computational complexity can be directly

modulated with respect to the precision of the depth map, both regarding pixel resolution as well

as depth precision. In this manner, it is possible to dedicate less time for a coarser reconstruction of

a scene, and thus obtain an anytime algorithm. This behavior is not trivially feasible with other

local stereo approaches. Owing to the above reasons, plane sweep is the approach adopted in this

work for realizing planetary rover stereo vision.

2.2 FPGA Implementations of Stereo Algorithms

The literature provides several FPGA implementations of stereo vision algorithms. In few cases

automatic HLS was employed to increase productivity at the expense of throughput [Rupnow et al.

2011], however, for HW efficiency purposes, the majority of works rely on manual VHDL/RTL

coding. Earlier approaches [Ambrosch and Kubinger 2010; Banz et al. 2010; Jin and Maruyama 2014;

Jin et al. 2010; Tomasi et al. 2012] generally concern images of VGA (i.e., 640 × 480) resolution or

lower, and process a few hundred frames per second (FPS) with limited disparity search ranges (e.g.,

230 FPS for 64 disparities in [Jin et al. 2010]). Later works such as [Greisen et al. 2011; Lentaris et al.

2012; Papadimitriou et al. 2013; Wang et al. 2015; Zicari et al. 2012] deal with Mega pixel images.

In particular, [Lentaris et al. 2012] accelerates an ordinary, fixed-window stereo algorithm on

a Xilinx Virtex6 FPGA as a proof-of-concept for use by future planetary rovers. This algorithm,

hereinafter referred to as gad, computes two disparity maps (used for left-right consistency cross-

checking) by using Gauss-weighted cost aggregation of absolute pixel differences in 7×7windows to



examine 200 disparities in 1120x1120 stereo images; subpixel disparities are computed with parabola

fitting. Aside from the Gaussian weighting and the different image resolutions and disparity levels

considered, gad is very similar to the stereo algorithm on-board the MER rovers [Matthies et al.

2007]. gad’s architecture relies on pixel-level pipelining and extensive resource reuse of on-chip

memory to consume only 3K LUTs and 101 RAMB36 while processing an image in 2.3 sec. In [Wang

et al. 2015], Altera Stratix FPGAs are used to accelerate a stereo algorithm employing AD-Census

cost computation, cross-based cost aggregation and semi-global optimization. The final system

processes up to 1600x1200 pixel images with 128 disparity levels at 42 FPS. However, it requires up

to 222K ALUTs and 16.6 Mbit RAM. [Tomasi et al. 2012] utilize a Xilinx Virtex4 FPGA to rectify

and process stereo images of VGA resolution at 57 FPS. The implementation uses a fine pipelined

method and multiple user defined parameters, which result in a total of 58K LUTs, 131 DSPs, and

100 RAMB18. Disparity computation is accelerated in [Greisen et al. 2011] with 54K LUTs (or

100K LUTs for the entire stereo pipeline) on Altera Stratix3 and processes 1080p images with 256

disparities at 30 FPS. [Papadimitriou et al. 2013] accelerate stereo matching with 38K LUTs on

Xilinx Virtex5 and process 1920x1200 images with 64 disparities at 87 FPS. [Pérez-Patricio and

Aguilar-González 2015] implement an adaptive window algorithm on an Altera DE2 board and

achieve 76 FPS for 1280 × 1024 images and 15 disparities.

Regarding multi-FPGA implementations, the relevant literature is very limited. [Darabiha et al.

2006] partition a phase-based stereo algorithm on a board containing four Xilinx Virtex 2000E

FPGAs and obtain dense disparity maps at a speed of 30 FPS for 256×360 images. The work of [Choi

and Rutenbar 2016] concerns a custom Markov Random Field inference system for stereo matching,

which relies on sequential tree-reweighted message passing and is accelerated on a Convey HC-1

platform, i.e., four Xilinx Virtex-5 V5LX330 FPGAs. [Lee et al. 2013] utilize two stacked Synopsys

HAPS-64 multi-FPGA boards for accelerating a mobile ray tracing architecture partitioned to eight

Virtex-6 LX760 FPGA chips.

In summary, a review of the literature reveals that FPGA stereo implementations for high-

definition images with increased depth accuracy (e.g., 1 Mpixel and at least 100 disparities), achieve

high FPS at the cost of increased FPGA resources, i.e. 38−222K LUTs. HW cost of such magnitudes

would over-utilize the limited size chips targeted in the current paper. In contrast, our work

concerns embedded systems with limited-capacity, space-grade HW and must attain the right

trade-off between FPGA resource utilization and speed sufficiency. Thus, given our goal of maximal

HW efficiency (i.e. throughput over cost), our approach is to trade one order of magnitude FPGA

resources for lower FPS, while maintaining conformance to ESA’s speed and accuracy requirements.

3 THE PLANE SWEEP ALGORITHM

3.1 Design and Software Development

Details of the stereo algorithm we developed, hereafter denoted as psweep, are given in this section.

psweep uses a plane as its sweeping surface, all instances of which are parallel to each other. They

are parameterized by their distance from a reference center O , typically chosen as the “cyclopean

eye”, placed midway between the two camera centers. psweep operates by sampling the swept

plane into cells (pixels) and translating it through space along a sweeping direction. Typically,

the swept plane is oriented to be approximately parallel to the two image planes, but this is not

mandatory if a justifiable prior can be presumed [Gallup et al. 2007].

The swept plane creates a family of parallel planes, which are denotedWi . At each posture i ofthe sweeping plane, the images of the input stereo pair, I1 and I2, are backprojected upon planeWi ,

forming imagesC1i andC2i ; for reference in Sec. 4, this operation is called image projection. Bymeans

of this projection, each point ofWi associates two viewing rays that intersect upon it. Fig. 1 (left)



Fig. 1. Left: Plane sweep illustration. Images acquired by the two cameras (red, green) are backprojected

upon hypothetical sweeping planes extending from the cyclopean eye (gray). Points A and B lie on such a

plane and are respectively tangent and non-tangent upon a physical surface. Middle: sample depth map in

pseudo-color obtained with the psweep algorithm for the left stereo pair of Fig. 2. Right: Photoconsistency

patterns as a function of depth. Textured surfaces give rise to clear local photoconsistency maxima (green).

Surfaces that lack texture may yield high photoconsistency values, but do not give rise to local maxima (red).

shows two such pairs of associated rays. The corresponding pixels inC1i andC2i are expected to be

photoconsistent if the intersection point of the ray withWi is tangent to a physical surface, because

they image the same real point (i.e. point A in the figure). In contrast, the pixels corresponding to

rays that intersect at a non-tangent point, i.e. B, are not expected to be photoconsistent because

they image different physical points. This observation constitutes the reconstruction principle of

plane sweep and determines a 3D point at the distance that photoconsistency is maximized, for each

viewing ray from the cyclopean eye. Subsequent adaptations of this idea have used more efficient

point sampling on the sweeping plane [Gallup et al. 2008], considered non-planar shapes for the

sweeping surface [Pollefeys and Sinha 2004; Zabulis et al. 2006], or exploited prior knowledge on

the location and orientation of the imaged structures [Gallup et al. 2007].

Most flavors of plane sweep assess photoconsistency using support windows (i.e., image patches)

centered at the points of interest inC1i andC2i . Grayscale intensities in two patches are compared us-

ing metrics like SAD, SSD, NCC andMNCC [Moravec 1977]. NCC andMNCC are routinely preferred

as they can withstand gain and bias changes, hence do not require the employed cameras to be ra-

diometrically calibrated. For two patches P1 and P2, NCC is defined as cov(P1, P2)/(√σ 2(P1)σ 2(P2)).

MNCC is defined as 2 cov(P1, P2)/(σ 2(P1) + σ 2(P2)), or, in an extended form

MNCC(P1, P2) =2 (n∑n

k∑nl P

k1P l2−∑n

k Pk1

∑nl P

l2)

n∑n

k (Pk1 )2 − (∑nk P

k1)2 + n∑n

l (P l2)2 − (∑nl P

l2)2, (1)

where Pk1, P l

2denote pixels k and l from patches P1 and P2, both of which consist of n pixels. The

advantages of MNCC over standard NCC are that it is faster to compute and tends to zero when

there is a significant difference in variance between the patches P1 and P2, closely approximating

NCC for equivariant patches.

The photoconsistency values for each cell of planeWi and depth d may be used to form a “photo-

consistency image”Mi , which is “stacked” in a 3D matrixM(x ,y,d). To quantify photoconsistency,

psweep adopts the MNCC metric because it exhibits increased robustness to textureless image

regions while providing reasonable performance at a low computational cost. The MNCC is com-

puted in square image patches with an acceptance threshold of 0.85; this is referred to as MNCCcalculation in Sec. 4. Image patches for MNCC are 13× 13 pixels, with similar results obtained using

slightly smaller or larger patch sizes. psweep’s implementation avoids recomputing intermediate



Fig. 2. Sample stereo pairs from the “rocky” (left) and “flat” (right) terrain types employed for evaluation.

The “flat” dataset contains weaker texture.

terms of the MNCC cost function, such as the sum of pixels within its support window or the

sum of their squares for each pixel. This is achieved by computing these intermediate quantities

through a convolution with a constant kernel of the same size as the integration window. Such a

convolution can be highly optimized on FPGAs. At each pixel, psweep computes depth using a

winner-take-all policy followed by interpolation via parabola fitting to improve resolution. The

combination of these two steps is referred to as map updating in Sec. 4. Fig. 1 (middle) shows the

depth map obtained with psweep for the left stereo pair of Fig. 2. 3D points are recovered from

their image coordinates and computed depths; this is called reconstruction in Sec. 4.

A crucial difference of psweep from ordinary plane sweep variants is that it requires that the

optimum giving rise to a reconstructed point is local. That is, the photoconsistency values at

both the preceding and succeeding distances should be suboptimal [Vogiatzis et al. 2007]. The

reason is that textureless regions provide spuriously high values of maximal photoconsistency

and provide a photoconsistency pattern that converges asymptotically to the maximum possible

photoconsistency value (see Fig. 1, right). As in [Collins 1996], the entire volume of photoconsistency

data is not retained in memory. As the sweeping plane proceeds, a 2D buffer marks for each pixel

the posture that has yielded the best score up to that instant; when sweeping concludes, it contains

the estimated depth for each pixel. To facilitate the determination of whether a particular value

is a local optimum, this buffer has three cyclically updated layers which store the preceding and

succeeding photoconsistency values.

For simplicity, the spacing of the sweeping planes could be kept uniform. In the current work, a

more efficient parameterization is realized by choosing these planes to be exponentially sparser

with distance [Pollefeys et al. 2008], accounting for image discretization (see Fig. 1, left).

3.2 Dataset, Metrics and Software Evaluation Setup

To evaluate the accuracy performance of stereo reconstruction, a binocular dataset with known

depth ground truth was synthetically generated. More specifically, a total of 68 stereo images

corresponding to various locations along two simulated rover trajectories were graphically rendered

and their depth maps retained. These images are of size 1120 × 1120 pixels and represent two

different terrain typologies (namely “rocky” and “flat”) in a Mars-like environment; sample frames

are shown in Fig. 2. To gauge stereo performance, we define the following metrics:

M1: Root mean square (RMS) error for reconstructed and ground truth depths whose difference is

at most 2 cm. This is the primary metric, intended to measure the stereo accuracy excluding

large errors.



Table 1. Accuracy metrics for the “rocky” and “flat” Mars-like datasets.

rocky flatMetric psweep gad psweep gad

RMS (M1, mm) 7.50 5.33 7.86 4.95

Large errs fraction (M2.a, %) 2.54 4.60 3.94 3.25

Large errs RMS (M2.b, mm) 102 1065 124 1065

Coverage ratio (M3, %) 98.57 98.25 84.44 98.33

Mean hole size (M4, pix) 87.96 371.41 101.99 174.59

Consolidated RMS (mm) 17.87 228.53 25.88 192.12

M2.a: Fraction of points for which the depth error is more than 2 cm. This metric is meant to

measure the frequency of points with large errors. Metric M2.b corresponds to the RMS error

for such erroneous points.

M3: Coverage fraction measuring the ratio of points that are reconstructed over those that could

be potentially reconstructed. This metric depends on scene texture and certain algorithmic

parameters, e.g. the MNCC threshold.

M4: Mean hole size defined as the ratio of not reconstructed (i.e. hole) pixels over the number of

connected components corresponding to hole pixels. This metric indicates the average size

of image patches for which no depth measurement is available.

Quality metrics similar to M1 and M2.a above are used in [Scharstein and Szeliski 2002], albeit

defined on the disparities rather than the depths.

3.3 Performance Results

First, psweep was implemented and substantially optimized in ANSI C. The depths obtained from

this purely SW implementation were used as a baseline against which potential performance

degradations due to simplifications in the FPGA implementation were compared. Furthermore,

to analyze accuracy, the estimated depths as well as those computed with 13 × 13 masks by the

implementation of gad from [Lentaris et al. 2012], were compared with the true depths using the

metrics presented in Section 3.2. For both stereo algorithms, Table 1 lists the values computed for the

various metrics from the “rocky” and “flat” datasets. It also includes the consolidated RMS error for

all reconstructed points, regardless of magnitude. The RMS error expressed by metric M1 is slightly

better for gad (around 5.0 mm versus 7.5 mm for psweep). This difference is nevertheless small and

can be attributed to the idiosyncrasies of the dataset marginally favoring one algorithm over the

other. M2.a, the fraction of points reconstructed with errors exceeding 2 cm is also comparable for

both algorithms and in all cases less than 5%. However, as can be seen from the third row of Table 1

which provides M2.b, the RMS error for points with errors larger than 2 cm, points reconstructed

with gad have errors which are one order of magnitude larger compared to the errors for the points

obtained with psweep. This is because psweep generates fewer spurious reconstructed points due

to its strategy of selecting local optimums in photoconsistency (cf. Sec. 3.1 and Fig. 1 right). Thus,

psweep has better overall accuracy, as can be confirmed by its significantly lower consolidated

RMS. The coverage ratio metric M3 well exceeds 80% for both datasets. M4, the average area of

holes is around 100 pixels for psweep and is smaller than that for gad.

To facilitate a better understanding of psweep stereo reconstruction errors, Fig. 3 provides two

histogram plots. Specifically, for the points reconstructed from each dataset, these plots illustrate

their total number and mean reconstruction error against the distance of reconstructed points from

the cyclopean eye. As can be clearly seen in the left plot, the majority of reconstructed points are

located near the stereo system. This is expectable, since compared to distant ones, closer areas



are imaged in greater detail, i.e., with a larger pixel count. The right plot in Fig. 3 shows that

reconstruction errors are larger for points further away from the stereo system, demonstrating the

well-known fact that the stereo measurement error is proportional to the square of depth. Due to

the synthetic texture’s granularity not being sufficiently high, very close surfaces are rendered with

low spatial frequencies and hence give rise to errors in stereo matching. This explains why the

errors for the few first closest distances are larger than those for the immediately farther distances.

Both plots in Fig. 3 start at 1 m, which is the minimum distance psweep is applied at, and extend up

to 4 m according to ESA’s requirements. The mean error in the right plot remains less than 1.6 cm.

We have also evaluated the performance of psweep and gad using version 2 of the Middlebury

stereo benchmark [Scharstein and Szeliski 2015]. Since the notion of stereoscopic disparity is central

to conventional stereo, the Middlebury benchmark employs disparities both to encode ground

truth and also to define performance metrics. Based on a set of test stereo images and their true

disparities, the benchmark involves three error metrics (nonocc, all, disc) defined as percentages of

erroneously matched pixels in certain image regions. The metrics are computed for every disparity

field estimated by a stereo algorithm being evaluated and used to rank this algorithm according

to each error metric and disparity field tested. The algorithm’s overall rank is then obtained from

the average of its ranks for each error metric and test image [Scharstein and Szeliski 2002]. We

note at this point that psweep does not compute disparities but rather 3D depths directly. Thus, in

order to facilitate the evaluation of psweep with the Middlebury benchmark, we used the following

post-processing procedure to compute disparities from its 3D reconstruction output. A pixel in

the reference image is associated with its reconstructed point in the cyclopean coordinate frame.

Owing to the images being rectified, the optical axes of all cameras are parallel and therefore,

reconstructed points have the same depth Z in the coordinate frames of all cameras. Then, the

disparity d of a pixel for which a 3D point with depth Z has been reconstructed with psweep is

computed with d = f b/Z , where f is the focal length and b the stereo baseline.

Using Middlebury’s evaluation protocol with the disparities computed by psweep and gad for

the test images, we computed the percentages of bad matching pixels listed in Table 2. When these

performance metrics were compared against those of 167 other disparity-based stereo algorithms

evaluated in the benchmark’s web page [Scharstein and Szeliski 2015], psweep ranked 38th(with a

score of 47.6, smaller is better) and gad 96th(scoring 89.9). Thus, psweep is in the top quartile (i.e.,

25%) of the algorithms and gad in the top 60%.We remark that several of the top performingmethods

on the Middlebury benchmark either involve expensive cost aggregation strategies, e.g. [Yang et al.

2014; Yoon and Kweon 2006; Zhan et al. 2016] or employ global disparity computation, e.g. [Liu

et al. 2015; Mozerov and van de Weijer 2015; Yang et al. 2009], and therefore are unsuitable for

implementation on resource-constrained FPGAs.

The average percentages of bad matching pixels over all test images were 2.43% for psweep

and 6.42% for gad. With these average bad matching pixel percentages, psweep ranks first in

accuracy with respect to the 19 stereo methods compared using the Middlebury test images in

Table IV of [Ttofis et al. 2015], while gad comes second. To the best of our knowledge, we have

reported above the first in the literature evaluation of a space sweep algorithm on the Middlebury

benchmark.

4 PSWEEP HARDWARE ARCHITECTURE DESIGN

To meet the time requirements of ESA and facilitate implementation on resource-constrained HW,

we accelerated psweep by applying our custom HW/SW co-design methodology [Lentaris et al.

2016]. The latter involves algorithmic profiling/analysis, HW/SW partitioning, HW architecture

design, parametric VHDL coding, system integration with CPU-FPGA communication, as well as

design space exploration, i.e. parameter tuning. Next, we describe all of our steps and their outcome.



Table 2. Percentages of bad matching pixels for psweep and gad on the Tsukuba, Venus, Teddy and Cones

test images from the Middlebury dataset. The rightmost column comprises the average of each row.

Tsukuba Venus Teddy Cones Bad pixelsAlgorithm nonocc all disc nonocc all disc nonocc all disc nonocc all disc average %psweep 4.29 4.95 1.41 1.21 1.83 0.31 3.67 4.43 2.51 1.38 1.97 1.21 2.43

gad 6.21 7.52 4.66 3.31 4.11 2.34 9.80 10.82 6.77 7.04 8.37 6.04 6.42

1 2 3 4Distance (m)

0

1

2

3

Rec

on

stru

cted

po

ints

no

(#) #106

rockyflat

1 2 3 4Distance (m)

0.006

0.008

0.01

0.012

0.014

0.016

Mea

n e

rro

r (m

)

rockyflat

Fig. 3. Number of points reconstructed with psweep (left) and their mean error (right) as a function of distance

from the cyclopean eye for the “rocky” and “flat” datasets.

Table 3. Profiling on Intel i5-4590 and LEON3@50MHz: 1-thread, 8-bit 1120x1120 image,kT depths (iterations).

psweep time/iteration utilization % data in memory

function (on i5-4590) (on LEON3) (Mbit) (MB)

image projection 86 msec 47.5 20

224MNCC calc. 94 msec 51.5 20×kTmap updating 1.7 msec 0.96 40×kTreconstruction 0.07 msec 0.04 40 4

total (per image) 55 sec 4778 sec 20 228

4.1 Algorithm Profiling & HW/SW Co-Design

The co-design methodology begins with a detailed analysis of the algorithm, which combines auto-

matic profilers and manual examination to partition it effectively into HW and SW components. To

accurately measure the performance of the code on a space-representative CPU, we port the entire

psweep on a soft-core LEON3 with RTEMS operating on a Xilinx FPGA at 50MHz. In addition to

execution time and memory footprint, we also assess the communication requirements of each func-

tion, the reuse of variables among functions, as well as their arithmetic requirements (fixed/floating

point operations, dynamic range, accuracy, etc). By also considering the capabilities/peculiarities of

space-representative FPGAs, e.g., Xilinx Virtex-5QV, we compare the above results to the exact

ESA specifications/budgets and we select the functions that must be accelerated on HW.

The profiling results are summarized in Table 3, which breaks up psweep in four main functions

and analyzes time, I/O, and memory usage. More specifically, assuming a total of kT = 301 swept

planes and stereo images of resolution 1120×1120, Table 3 reports the execution time on Intel Core

i5-4590 (average msec per iteration, plus total time per image), the execution time on LEON3 (in

terms of utilization per function, plus total time per image), the I/O requirements per function (total

Mbits input from the previous function), and the memory footprint. In total, psweep consumes

an excessive amount of time on LEON3, requiring well over an hour per image pair. Projecting



Fig. 4. High-level HW architecture of psweep based on deep pipelining at pixel-level & on-the-fly processing

input images on the sweeping plane consumes almost half of psweep’s time. Even more expensive

is the computation of the 13 × 13-windowed MNCC function. Hence, it becomes clear that these

two functions must be accelerated on the FPGA. Map updating is almost 10x faster than projection

and MNCC. However, its increased I/O (40 Mbit per iteration) prohibits its separation from MNCC

as their communication would increase the amount of off-FPGA data transfers, thus stalling

the HW/SW co-processing. In contrast, the reconstruction transforming the final map to world

coordinates is executed only once, after the sweep, consumes little time (2 sec on LEON3) and

requires limited I/O (just the final map data). In terms of arithmetic precision, the first three

functions can operate with fixed-point arithmetic, whereas the last function relies on floating-point

transformations. Considering all the above, we decide to accelerate the entire psweep on HW

except from the final reconstruction, which will be executed on the CPU handling the I/O with

the FPGA. In this manner, we accelerate 99.9% of psweep computations and aim at a speedup

factor of around 1000x over space-grade LEON to meet ESA’s requirements. Owing to the memory

requirements exceeding the storage capacity of FPGAs, processing should proceed in a stream

mode, with pixels acted upon and results forwarded to the host progressively and with minimal

buffering. For CPU-FPGA communication, we rely on ordinary 100 Mbps Ethernet, which involves

a HW controller (CSMA/CD LAN IEEE 802.3) and a SW driver handling packets of 1500 bytes MTU.

4.2 Proposed HW Architecture

The high-level architecture of the proposed HW engine is depicted in the block diagram of Fig. 4.

Overall, our engine consists of a control unit and five components, which collectively form a very

deep pipeline starting at the “image memory” and terminating at the “depth map” memory. The

pipeline sustains an internal throughput of one MNCC value per cycle, i.e., it updates one depth

map value at every cycle in the course of examining each new depth plane (cf. section 3.1). To

sustain such throughput, the pipeline operates on a pixel basis and integrates dozens of processing

stages (70 − 84 depending on the design parameters, e.g., the MNCC size), which are fine-tuned at

register-transfer level to allow for a high clock frequency. As a result, the proposed architecture

achieves an increased level of parallelism (effectively, dozens of pixels are processed in parallel

within our deep pipeline), which is further increased by handling the left and right images of the

stereo pair in parallel. Moreover, aiming to minimize the on-chip memory utilization, our design

performs on-the-fly processing. In contrast to conventional SW approaches, our architecture avoids

storing any intermediate values/results other than the input pixels and the depth outputs, i.e., with

almost 100% HW utilization, the pipeline transforms the information as it flows from stage to stage

without employing temporary buffers, apart from a limited-size RAM needed by MNCC.



The central control unit initializes all components (loads the image memory, resets the registers,

etc.) and executes the main loop of psweep: at each iteration k , it assumes a new hypothetical depth

plane Dk and commands the processing modules to evaluate the current hypothesis for all pixels of

theWi ×Hi image. More specifically, assuming a sweeping plane of sizeWp ×Hp at depth Dk , the

“address generator” will scan the stored image to facilitate the projection of each pixel fromWi ×Hito a specific location on theWp ×Hp plane Dk . The image scanning is performed so that plane Dkwill be gradually filled with one pixel per cycle in a raster scan order. That is, depending on Dk ,

the scan begins from a pre-determined location ⟨x0,y0⟩ in the image and continues according to a

pre-determined non-integer step (from left to right, top to bottom). These fixed values are stored in

a look-up-table, e.g., ROM, accessed at the beginning of each iteration k . The “address generator”computes one pair of ⟨xn ,yn⟩ vectors per cycle, n ∈ [1, Wi · Hi ], one referring to the left and one tothe right image. The ⟨xn ,yn⟩ vector is translated within the “storage” module to a memory address

and a refinement vector. The memory address is used to fetch a square of 4 neighboring pixels from

the image, whereas the refinement vector is used to guide the “interpolation” module to weigh the

4 fetched pixels and generate 1 projected pixel on the depth plane Dk .

Two projected pixels per cycle (one from the right plane and one from the left) are forwarded to

the “similarity” module, which evaluates the photoconsistency of the left and right planes according

to the MNCC metric. It utilizes small internal buffers to collect multiple pixels per support window

and calculate one MNCC value per pixel location (in total,Wp ×Hp values per Dk ). The “updating”

module receives one MNCC per cycle and compares it to the best value computed so far for the

specific location on the plane. In case of a photoconsistency improvement, the depth map is updated.

Iteration k completes when the entire plane Dk has been scanned (inWp × Hp cycles) and the

control unit proceeds to the next plane Dk+1 by re-commanding the five modules of the pipeline.

The architecture is implemented with fixed-point arithmetic and parametric VHDL, which

allows the granularity of the sweep, i.e., the number of main loop iterations kT , to be configuredat compile time. The same holds for the granularity of the sweeping plane,Wp ,Hp , the image

size,Wi ,Hi , the MNCC support window size,M , and the word-lengths of the internal datapaths

(for precision tuning). Moreover, to tackle the limited on-chip memory of FPGAs, we equip our

psweep engine with an image partitioning mechanism to decompose the input data as follows.

The CPU divides the image in B horizontal bands (stripes) of size Wi × Hb , with Hb = Hi/B,which are processed successively by reusing the same FPGA resources: each band is downloaded

to the FPGA, processed in kT iterations, and the resulting depth map stripe is uploaded to the

CPU almost independently of the next band (in practice, the bands overlap at their borders to

facilitate the correct sliding of the MNCC mask between them). Band height, Hb , is also provided

as a VHDL parameter. Overall, our HW design combines deep pipelining, on-the-fly computation,

tight synchronization between modules to maximize data flow and HW utilization, parallel memory

organization and parallelization of arithmetic to sustain the pipeline throughput regardless of

local algorithmic requirements, minimization of on-chip buffers and data transfers in CPU-FPGA

communication. Changes of setup/parameters during the rover’s lifetime can be accommodated via

FPGA reconfiguration. The following paragraphs elaborate on the architectural details of the main

modules of HW psweep.

4.2.1 Image Storage and Projection. Almost one third of the on-chip memory in psweep is used

for storing the input stereo pair, i.e., the pixels of a band (updated B times per image). To support the

aforementioned pipeline throughput, the “storage”module utilizes two 4-bank parallel organizations,

one storing the left and one storing the right image. The 4 banks {A,B,C,D} are interleaved in both

x and y directions of the image (i.e., ABABAB . . . for the first row,CDCDCD . . . for the second row,etc.), so that any 4 neighboring pixels are mapped to distinct banks. Therefore, in a single cycle, we



can access any square quadruplet from any location of the image without memory conflicts. The

translation of a ⟨xn ,yn⟩ pointer on the image is performed according to the mapping

bank(xn ,yn) = (⌊yn⌋ mod 2) · 2 + ⌊xn⌋ mod 2

addr(xn ,yn) = (⌊yn⌋ div 2) · (Wi/2) + ⌊xn⌋ div 2,(2)

where ⌊x⌋ ≡floor(x) denotes the integer part of x ≥ 0.

With the use of a 4-word barrel shifter, the calculated addr(xn ,yn) is forwarded to the bank(xn ,yn).In parallel, addr(xn + 1,yn), addr(xn ,yn + 1) and addr(xn + 1,yn + 1) are also forwarded to their

corresponding banks. The key-value bank(xn ,yn) is delayed via shift-registers and routed to the

output of the 4 banks to align the fetched quadruplet via a second barrel shifter according to

the ⟨xn ,yn⟩ request. While the integer part of ⟨xn ,yn⟩ is used in eq. (2) for fetching the pixel

quadruplet, the fractional part of ⟨xn ,yn⟩ is forwarded to the “interpolation” module. The fractional

part ⟨x f rn ,yf rn ⟩ forms a refinement vector, which guides a bi-linear interpolation to generate a new

pixel in-between the fetched square of pixels (practically, the xf rn and y

f rn act as weights). This

interpolant constitutes the value of a pixel being projected from the image to a specific location on

the sweeping plane Dk (the successive ⟨xn ,yn⟩ requests of the “address generator” are set up so thatthe projections cover the sweeping plane in integer raster-scan order). The fetching pipeline has 8

stages and the interpolation pipeline has 12 stages (MULT/ADD arithmetic). The two pipelines are

cascaded as shown in Fig. 4 to input one ⟨xn ,yn⟩ request and output one interpolant per cycle.

4.2.2 Similarity Metric Calculation. The “similarity” module correlates the left and right planes

Dk on-the-fly, while they are output from the “interpolation” module in a pixel-by-pixel fashion and

in raster-scan order. The module computes 5 distinct sums per cycle and combines them according to

eq. (1). The challenge is to perform all summations by accessing each pixel once (and not repeatedly

for multiple overlapping support windows), i.e., without storing the left and right planes Dk . To

achieve this optimization, we propose a parallel architecture based on serial-to-parallel buffers and

partial sum calculators. Fig. 5 depicts the proposed architecture with 2-pixel input per cycle (one

from each plane) and 1 MNCC output. Each “buffer” consists ofM FIFO RAMs, each one of depth

Wp , whereM denotes the height of theM ×M support window (e.g.,M=13) andWp the width of

the plane (e.g.,Wp=1117). The FIFOs are serially connected to each other, such that when a new

pixel enters the first FIFO, then all FIFOs update their outputs and operate as a sliding window on

the sweeping plane. Once every cycle, the window slides by one pixel in a raster scan order over

the plane, allowing a new column of M × 1 pixels to be read concurrently. The M × 1 column is

forwarded in parallel to five distinct components, which compute the five sums involved in eq. (1).

Fig. 5 depicts the structure of such a component assuming twoM-word inputs, {w j } and {ej }, withj ∈ [1,M]. Notice that in practice, depending on which sum of eq. (1) we need to compute, we

instantiate w j = 1 (for

∑Pl ), or w j = ej (for

∑P2

l ), or ej = w j (for∑P2

r ), or ej = 1 (for

∑Pr ), or

w j , ej (for∑Pl · Pr ), where Pl and Pr denote pixels from the left and right plane, respectively.

TheM input pairs are routed toM distinct multipliers to derive the {w j · ej } products, which are

added by a tree structure to form a partial sum of the corresponding term used in eq. (1). In Mconsecutive cycles,M partial sums are accumulated to calculate one complete term of eq. (1). We

note, however, that before feeding the aforementioned accumulator, we subtract the partial sum

computed M cycles in the past from the current value, because we must hold exactly M partial

sums in our accumulator in the course of scanning the plane and avoid indefinite increases. For this

reason, each new partial sum is also delayed by a depth-M shift-register. In parallel, we feed the

five calculated terms (i.e., complete sums) to the “fraction” component to generate the numerator

and denominator of eq. (1) (with adders/subtractors, squaring units, and flip-flops serving as local

synchronizers) and feed a pipelined fixed-point divider generating the MNCC output. Cascading



Fig. 5. Parallel architecture and HW pipelines of MNCC “similarity” module.

the above components forms a deep pipeline, consisting for example of 47 total stages forM=13

and 13-bit accuracy (23 stages in “term calculator”, 15 in “divider” and 9 for the rest), to sustain a

throughput of one MNCC result per cycle.

4.2.3 Depth Map Updating. Each MNCC result is compared to the best MNCC computed so far

for the current ⟨x ,y⟩ location on the plane over all Dk iterations. For this purpose, the “updating”

module includes a RAM of depthWp · Hp to store the minimum required information per ⟨x ,y⟩location. First, it stores the best MNCC valueVb and its corresponding depth Db , which are updated

based on the comparison to the new MNCC value Vk at Dk . The comparison is implemented via

a read-write loop around the RAM block. That is, we employ a dual-port memory and develop a

7-stage pipeline starting and ending at the RAM block to continuously a) read the RAM address

addr, b) compare the addr contents to the currentVk , c) conditionally update the addr contents withVb = Vk and Db = Dk , d) increase addr by 1. We note that the local counter addr is synchronized to

the input rate of the module (we raster-scan the entireWp ×Hp plane), whereas the writing addr is

a mere delay of the read addr (the pipelined loop includes multiple registers to synchronize the data

and assist the EDA tools in routing the circuit). Second, to facilitate the improvement of psweep’s

precision via depth interpolation, i.e., to fit a parabola around the best MNCC value Vb according

to the formula (Vb−1 −Vb+1) / (Vb−1 +Vb+1 − 2Vb ), the module stores temporarily all the MNCC

valuesVk of the current depth plane (each one in a distinct RAM address, which is updated at every

main loop iteration k unless a Db is discovered). More specifically, when a new value Vk for ⟨x ,y⟩enters the module, a designated stage of our pipeline examines whether the previous value Dk−1



was selected as Db for that specific ⟨x ,y⟩. In this case, the information currently flowing in the

module includes Vb (the stored Vk−1), Vb+1 (the current input Vk ), and Vb−1 (the value Vk−2, whichwas not updated at the previous iteration k − 1, because a Db was discovered). The three values

⟨Vb−1,Vb ,Vb+1⟩ originating from the module’s I/O ports and internal RAM are synchronized and

routed to a secondary pipeline of f + 8 stages, where f denotes the fractional bits output from

depth interpolation (e.g., f = 20). This pipeline operates in parallel to the aforementioned 7-stage

primary pipeline and involves adders and a divider to compute the parabola fitting formula. The

secondary pipeline terminates at a distinct RAM of sizeWp ·Hp · f , which stores the fractional parts

of the depth map (the integer parts are indexed on CPU via the Db results). Notice that our HW

implementation possibly computes the fractional part multiple times per each ⟨x ,y⟩ in the course

of psweep, as opposed to typical SW implementations, which perform only one such computation

at the end of all kT iterations. However, with the proposed scheme of performing interpolation

on-the-fly, the HW avoids storing all Vk values for all kT iterations, i.e., we store on-chip the

minimum number of Vk values (namely 2 per ⟨x ,y⟩) and minimize the memory utilization without

stalling the computation as depth interpolation is performed in parallel to all other tasks. Upon

completion of all kT iterations, a local FSM forwards and clears the contents of the local 2-bank

RAM to the output of the HW psweep engine.

4.3 Design Space Exploration & Fine-Tuning on FPGA

To assess the cost-speed-accuracy trade-offs in HW psweep and customize our engine for various

applications/devices (e.g., commercial Virtex-6, space-grade Virtex-5, European NG-MEDIUM), we

perform a design space exploration with the aid of our parametric VHDL code. The key parameters

affecting the sweep’s granularity, i.e.,Wp ,Hp ,kT , were derived during algorithmic design to achieve

the accuracy results of section 3.3 and are retained here. For conciseness, we omit presenting few

word-lengths in internal datapathswithminor impact on cost (e.g., the fractional bits in interpolation

affecting only few ADD/MULT, or the fractional bits of output depths). We select a representative set

of parameters, sufficient for our fine-tuning of psweep, and we implement numerous configurations

to evaluate the FPGA results. Specifically, we tune the following 4 parameters with trade-offs in:

• Image rows: related to the partitioning of the input image in bands, it determines the number

of image rows stored on-chip. Hence, it trades on-chip memory for comm/execution time.

• Plane rows: related to the partitioning of the output map in bands, i.e., the rows processed in

a single burst. It trades memory for execution time (decreases overhead of pipeline refilling).

• MNCC mask: the sizeM of the MNNC support windowM×M . It trades psweep accuracy for

HW cost (M increases the number of MNCC buffers, size of MULT-ADD trees, and latency).

• MNCC bits: related to the fixed-point precision of MNCC, used for word-length optimization.

It trades psweep accuracy for HW resources (increases the bits involved in the calculations).

To illustrate the explored design space, we provide three distinct figures based on results obtained

with Xilinx ISE 14.7 for Virtex6 VLX240T: Fig. 6 depicts the FPGA resources for six configurations

with the parameter values shown in x-axis as ⟨Image rows - Plane rows - MNCC mask - MNCC bits⟩quadruples. Fig. 7 reports the total execution time of HW psweep with respect to band height, i.e.,

Plane rows, for various MNCC mask sizes, assuming a clock frequency of 172 MHz. Finally, Fig. 8

shows the accuracy of HW psweep by comparing the FPGA output to the SW results (used here

as reference, analyzed in Section 3.2) while varying the MNCC bits (i.e., the main source of HW

error, the other modules are ignored). For clarity, the y-axis of Fig. 8 shows in logarithmic scale

the number of errors, i.e., the values which are not equal to the SW results, expressed in distinct

custom units (e.g., in 500’s for depth values having more than 2 cm mean error).



Fig. 6. Resources of 6 psweep configurations denoted by ⟨Image rows - Plane rows -MNCC mask -MNCC bits⟩

Fig. 7. FPGA time per image versus band size .......

(image = 1120x1120x2 pixels, .. clock = 172 MHz).

Fig. 8. psweep accuracy with respect to HW datapath

bits (differences versus SW implementation, for varying

MNCC bits) for 2.5 Mpixel stereo image (1120x1120x2).

In addition to providing a thorough evaluation of psweep, figures 6, 7, and 8, guide us to an

efficient balancing of parameters for the given application. Hereafter, we fix the MNCC to 13 bits,

because additional bits provide negligible improvement in accuracy at the cost of considerable

increase in RAMB/LUT, e.g., +6% if we use 18 bits; compared to the SW reference, 13 MNCC bits

lead to 99.9% of the depth results having error less than 0.25mm, only 0.01% having error more than

2 cm (i.e., 12 in a million), and 99.7% of the map holes remain intact (HW and SW implementations

reconstruct the same area in front of the rover). We note that MNCC bits have a minor effect on

speed. Instead, speed decreases considerably with the increase of the MNCC mask over Plane rowsratio, which corresponds to the overhead of re-filling the deep pipeline and MNCC buffers each

time we process a new band. Plane rows trade speed for RAMBs, i.e., the increase from 30 to 40 rows

results in approx. 33% more RAMBs with only 7% time gain. Notice that, among all FPGA resources

(LUTs, DFFs, RAMBs, DSPs), we pay particular attention to decreasing the memory due to the

limited RAM available on-chip the FPGAs (especially in NG-MEDIUM, where the total amount is

only 2.7 Mbits, or 56 RAMB48s). Therefore, given that even the small bands respect our time budget

(except, e.g., 10 Plane rows with 13 × 13 or 15 × 15 MNCC mask), we opt for 30 Plane rows. Overall,we select the “44-30-13-13” configuration as the most suitable for meeting all ESA requirements

with sufficient safety margins, especially with respect to accuracy.



Table 4. HW cost analysis of the final psweep configuration.

FPGA resources on Virtex 6 (xc6vlx240t)components LUTs DFFs RAMB36s DSPscontrol unit 115 (1%) 51 (1%) 0 (0%) 0 (0%)

address generator 498 (1%) 427 (1%) 1 (1%) 0 (0%)

storage module 783 (1%) 812 (1%) 28 (6%) 0 (0%)

pixel interpolation 437 (1%) 594 (1%) 0 (0%) 8 (1%)

MNCC similarity 2,293 (1%) 2,481 (1%) 12 (2%) 45 (5%)

updating module 1,322 (1%) 1,382 (1%) 59 (14%) 0 (0%)

total 5,448 (3%) 5,747 (1%) 100 (24%) 53 (6%)

Table 5. HW comparison to relevant works in the literature (PDS=resolution·disparities·fps, eff =PDS/LUT).

publication [Lentaris et al. 2012] [Tomasi et al. 2012] [Wang et al. 2015] [Cocorullo et al. 2016] current work

cost (RAMB/LUT) 109 / 8.5K 99 / 58K ∼830 / 222K 32 / 70K 100 / 5.5K

speed (PDS) 109M 4505M 10472M 1253M 222M

HW efficiency 13K 78K 47K 18K 40K

4.4 FPGA Implementation Results & Comparison to Literature

The single-FPGA cost of the selected psweep configuration (specifically 44-30-13-13) is analyzed in

Table 4. The most demanding modules in terms of logic are “similarity” (due to the MNCC formula

complexity) and “updating” with depth interpolation, whereas the most memory demanding are

“storage” and “updating”. We note that while these utilization ratios appear small in the commercial

Virtex6, they however challenge the less capable space-grade FPGAs. The max clock frequency is

300 MHz on xc6vlx240t-2, where a 1120x1120 stereo pair of 301 depths completes in 1.7 sec.

Compared to existing works in the literature, the proposed psweep proves HW efficient and very

low-cost (Table 5). For example, when implementing gad [Lentaris et al. 2012] with a 13 × 13 mask

(for fairness to MNCC mask), then the LUT+RAMB cost of psweep becomes smaller than gad’s

(8.5K LUTs, 109 RAMB36) with almost half execution time (gad computes two depth maps for

bi-directional consistency checking) and better accuracy (cf. Sec. 3.3). Secondarily, gad utilizes zero

DSPs due to its less complex SAD-like similarity metric (for comparison purposes, the cost of psweep

implemented without DSPs is 12.6K LUTs). Moreover, at the algorithmic level, psweep is more

configurable compared to gad (has parametric granularity of depth and plane resolution). Compared

to the stereo core of [Wang et al. 2015], psweep provides the same HW efficiency= throughput / logicwhen configured for fairness at 128 integer depths and 3.8 Mpix image resolution, which results

in approximately 1 FPS at 300 MHz with 41x less LUTs than [Wang et al. 2015] (and almost 5x

less on-chip memory). Similar conclusions for our HW efficiency are drawn when refining this

ratio to use the PDS throughput (the PDS metric combines resolution and depths and FPS). psweep

exchanges ∼6x PDS for ∼12x less LUTs compared to [Cocorullo et al. 2016], whereas it utilizes

∼11x less LUTs and half DSPs than that of [Tomasi et al. 2012] (the increased HW efficiency of

which comes at the expense of decreased reconstruction accuracy, as indicated by its performance

on the Middlebury images). This order of magnitude in LUT decrease emerges also versus the

Virtex6 implementation of [Jin and Maruyama 2014] (∼22x), the Kintex7 implementation of [Ttofis

et al. 2015] (∼18x), or even the Altera implementation of [Shan et al. 2014]. Therefore, given the

above comparisons, it becomes clear that psweep achieves state-of-the-art HW efficiency with the

lowest cost implementation (by one order of magnitude), which makes it more suitable for use on

resource-constrained, space-grade FPGAs (meets the goal/trade-off explained at the end of Sec. 2.2).



PartitioningMethodology

Synthesis &Implementation

Proper operation?

SynchronizationMethodoofy

Partition changes?

no yes

no

yes

End

Start

Fig. 9. Overview of proposed methodology for multi-FPGA implementation.

5 MULTI-FPGA IMPLEMENTATION

The last part of our work focuses on the implementation of psweep on actual space-grade technology

and, in particular, on providing working solutions with limited-size FPGA devices. We note that,

by exploiting the resource optimization of our previous section, we can now target rad-hard-by-design FPGAs, which support reliable embedded system development without additional mitigation

techniques, e.g., Triple Modular Redundancy. Despite the existence of a relatively big rad-hard

FPGA, namely the Xilinx Virtex-5QV, certain situations mandate the use of even smaller devices,

such as the Virtex-4QV (e.g., due to heritage issues), the Microsemi ProASIC3 (e.g., due to low-power

constraints), or the latest NanoXplore NG-MEDIUM (e.g., for reliance on European HW). However,

these devices provide a very limited amount of resources and cannot support the entire psweep,

especially when the embedded system/application is expected to include additional HW functions.

When considering limited-size devices, we propose the custom methodology for multi-FPGA im-

plementation shown in Fig. 9. It consists of two inter-dependent parts: the Partitioning, which refers

to the sequence of steps required to partition the design in multiple devices, and the Synchronization,which gradually establishes correct communication/synchronization among the devices.

5.1 Partitioning Methodology

Similar to HW/SW partitioning, the HW/HW partitioning involves analysis of the algorithm/design,

exploration of the HW platform and careful decisions. We identify 3 steps:

(1) Analysis of the complexity per design component: perform fine-grain estimation of

resources (LUTs, RAMBs, DSPs and I/Os) for each component/function to facilitate informed

decision making throughout the entire process.

(2) Partitioning and Mapping: based on the analysis, we manually select the best mapping of

components to devices. The goal is to perform a balanced partitioning of the design in terms

of resources. Furthermore, we seek to minimize the number of interconnections (cut size)

between the partitions/devices while respecting pin constraints (e.g., number), in order to

simplify the integration of the system and decrease power dissipation (less I/O pin utilization).

(3) Trace assignment: we determine the interconnections between the devices, i.e., specific

pins and board traces, by considering the characteristics of the underlying platform (e.g.,

PCB architecture) and the available cables/connectors to support the connectivity of the

FPGAs. We manually seek to employ traces/cables with matching characteristics (e.g., length,

resistance) to decrease transfer delay variations in the network and avoid de-synchronization.

5.2 Synchronization Methodology

When targeting high-performance multi-FPGA implementations instead of mere ASIC prototyping,

synchronizing the high-speed signals among the devices becomes very challenging. Most often, the

default implementation of the EDA tools becomes non-functional at high clock rates. To overcome



problems such as bus skew, ringing effects, jitter, etc., we examine the factors that increase suscepti-

bility to de-synchronization and develop our proposed guidelines accordingly. Prior to enumerating

the methodology steps, we discuss various issues encountered during our implementation testing.

Distribution of the clock. The choice is between system-synchronous and source-synchronous

distribution. In a system-synchronous design, the clock is provided directly from the PLL oscillator

of the board to all devices simultaneously. In a source-synchronous design, the clock is provided by

the transmitting to the receiving device, i.e., the clock signal travels alongside the data suffering

from similar effects/delays on the board and, thus, limiting the propagation variability between

clock and data. Our experiments with both schemes favored the system-synchronous distribution,

as it proved more reliable on the employed multi-FPGA platform (Synopsys HAPS [Synopsys 2017]).

Digital clock managers. DCMs (also named MMCM) are used to implement various functions

such as delay locked loop (DLL), digital frequency synthesis (DFS) and digital phase shifting (DPS).

One of the main benefits exploited in this work was the de-skewing of the input clock (DLL) for

aligning the internal clock of the FPGA with the external (incoming) clock. Furthermore, DCMs

can be used to address the bus skew problem occurring when multiple data signals travel between

the devices (e.g., violation in setup times due to variation of the transfer delays); by shifting the

clock phase between the devices via the DCM, the rising-edge position can be adjusted so that the

arriving data signals can be sampled without timing violations. However, phase shifting is effective

only when the variation of delays is relatively small compared to the clock period.

Trace reassignment. The communication among multiple FPGAs can be realized with various

types of board traces and/or support cables (I/O extension), which are described by different elec-

trical characteristics, such as wire resistance, length, etc. Combining different type of traces/cables

introduces de-synchronization. Instead, we must match the selected traces towards homogeneous

transfer delays of signals.

Packing of registers to I/O blocks. De-synchronization also occurs due to bus skew internally to

the FPGA, i.e., due to the diverse lengths of the interconnection nets (I/O nets). To overcome this

problem, we pack the registers in the I/O blocks. Assuming sufficient available registers in the

design, automatic I/O packing leads to nets of almost equal length and balanced propagation delays.

Insertion of extra I/O registers (delay lines). Assuming that extra registers do not violate the re-

timing constraints of the digital design to de-synchronize it at cycle level, the technique is useful

for two reasons. First, in case of limited number of utilized registers, for packing the extra registers

in I/O blocks as described above. Second, in case of general timing violations, the extra registers in

the I/Os paths facilitate the placement and routing (PAR) tools in meeting the plethora of different

requirements for each distinct net.

Specification of the I/O Standard. The I/O Standard (e.g. LVCMOS, LVDCI, LVTTL, HSTL) defines

the electrical behavior of the input receivers and the output drivers of the FPGA. It determines

specific characteristics, such as the output drive voltage (e.g., 1.2V, 1.8V, 2.5V, 3.3V), the slew rate

(e.g., fast, slow) and the drive strength (e.g., 6, 8, 12). Carefully exploring this parameter and selecting

same standards among devices is a crucial customization with respect to the board’s capabilities.

Considering all the above, we devised the synchronization methodology depicted in Fig. 10. We

present a sequence of steps, which are triggered when the previous fail to achieve synchronization.

The first and last steps/boxes, essentially, refer to the partitioning of section 5.1 (to connect the 2

methodologies of Fig. 9). The final step implies that all other techniques have failed and a refinement

of partitioning is required, most probably, at the expense of resource balancing. For such refinement,

we usually start form the previous partitioning and move back/forth the cut(s) of the computational

graph until, e.g., we decrease the cut-size to derive a smaller number of traces and/or traffic between

the nodes/devices that could not be synchronized (to make their communication less demanding).



Examine I/O Standards, BoardTraces, Clock Distribution

Partitioning Methodology (Initial implementation)

Insert DCMs

Customize Clock Phases

Insert Extra Registers

Pack Registers to IOBs

Customize Clock Phases

Explore I/O Standards, BoardTraces, Clock Distribution

Partitioning Methodology (Optimized, e.g., w.r.t. cut-size)

Fig. 10. Proposed multi-FPGA synchronization methodology.

Overall, our proposed synchronization steps are summarized as follows (in Fig. 10, we highlight in

orange the steps giving rise to an increase of FPGA resources):

(1) Initial Partitioning: We begin with the solution provided by 3rd

party tools (e.g., Synopsys),

which includes automatic timing constraints for realizing the communication among devices.

(2) Examine I/O Standard, Board Traces and Clock Distribution: Repetitively, at tool level, we

alternate among the most promising I/O signaling combinations depending on the capabilities

of the underlying multi-FPGA board.

(3) Insert DCMs: At HDL level, we place DCMs in all FPGAs to remove the clock delay/skew.

(4) Customize Clock Phases: At tool level, we perform clock phase shifting between the devices

via the DCMs, e.g., at 90o, 180

o, or 270

o, for the reasons explained above.

(5) Pack Registers to I/Os: At tool level, we select the packing of registers in the I/O blocks

aiming to internally minimize and balance the FPGA I/O net delays.

(6) Insert Extra Registers: At HDL level, by respecting the digital operation schedule (cycle-wise),

we insert registers at the I/O of partitioned components to facilitate further register packing.

(7) Explore I/O Standard, Board Traces and Clock Distribution: Similar to step 2, but performed

thoroughly to explore all possible combinations and board performances.

(8) Refine Partitioning: We manually derive a new partitioning towards minimizing the commu-

nication load between problematic devices/nodes (traces and/or transactions).

In conclusion, the best approach tomulti-FPGA design was the “semi-automatic” one.When using

the default output of the “fully-automatic” tools, we usually derived non-functional implementations,

especially at high clock rates. With the “manual” approach, i.e., partitioning and assignment done

entirely by hand, the effort increased disproportionally and the design became prone to bugs due

to programming with codewords from datasheets, etc. Instead, the “semi-automatic” approach

combines the best of both worlds, i.e. the GUI automation with the human control/guidance.

5.3 Experimental Setup

To validate our methodology and demonstrate the relevant multi-FPGA solutions for psweep, we

set up a tool-flow, a HW platform, and a limited-size space-grade device. As a proof-of-concept,

we assume the latest NanoXplore NG-MEDIUM rad-hard device [Le Mauff 2018] and we use the

HAPS-54 prototyping platform [Synopsys 2017]. The European NG-MEDIUM device is built on

STM 65nm rad-hard technology and provides 34K LUT4, 32K DFF, 112 DSPs, with 56 RAMB48 (2.7

Mbits, not sufficient for psweep). The Synopsys HAPS-54 multi-FPGA board (Fig. 12, left) consists

of four Xilinx Virtex-5 LX330 FPGAs (built on 65nm technology like NG-MEDIUM, with 207K

LUT6, 207K DFF, 192 DSP, 288 RAMB36) and a number of predefined and flexible interconnections.

The limited number of on-board fabricated tracks, e.g., 354 fast or 110 slow or 238 global traces,have varying electrical characteristics and transmission delays, e.g., 1.2ns or 2.8ns. Hence, they



Fig. 11. Multi-FPGA implementation tool flow.

form a suitable testbed for evaluating our synchronization techniques. Also, we can employ various

external plug-in connectors/cables, e.g., con_2x1 or con_cable with ∼4ns delay. Overall, we emulate

the limited-size device challenge on HAPS-54 by constraining the VLX330 utilization according to

the NG-MEDIUM specifications.

Regarding the EDA tools, Fig. 11 illustrates the entire tool flow and the required inputs. The

inputs refer to device-specific files for the instantiation of various IPs (e.g., memories, DCMs),

source code files (VHDL, Verilog), constraint files for all the FPGAs (timing and placement) and

dedicated board files that define various parameters of the employed board (e.g., HAPS-54) such as

the reset and clock distribution. In the first stage of the tool chain, we use the Synopsis Certify

tool to perform the partitioning of the design and the trace assignment to interconnect the FPGAs.

In the second stage, we use the Synopsis Synplify tool to separately synthesize for each device

every partition derived from Certify and generate the corresponding netlist file. Moreover, in this

stage, we manually perform optional customizations in the design according to the aforementioned

methodology (insert registers, DCMs, etc.). Finally, Xilinx ISE is required to perform place & route

(PAR) for each FPGA and generate the corresponding bitstream. The necessary inputs are the

synthesized netlist and a file containing the placement/timing constraints retrieved from Synplify.

5.4 Implementation Results

To put things in perspective, psweep was first implemented on a single rad-hard FPGA, i.e., on

Xilinx Virtex-5QV (total size 82K LUTs, 320 DSPs, 298 RAMB36) with ISE 14.7. Our final psweep

engine fits in a single device together with an 100 Mbps Ethernet controller, which includes local

buffers for pipelining the FPGA processing and CPU communication steps instead of stalling our

HW engine. Specifically, psweep consumes 5.9K LUTs, 53 DSPs, and 108 RAMB36, whereas the

Ethernet arbiter utilizes 3K LUTs and 55 RAMB36. The maximum clock frequency reported by

Xilinx Timming Analyzer is 166 MHz, i.e., the 2.5 Mpixel stereo image can be processed in 3.1 sec.

Based on these results and extrapolations that also regard the European FPGAs [Le Mauff 2018],

we estimate that, in such space-grade chips, psweep will consume power in the area of 4–10 watts,

which is today considered as an acceptable budget for the rover applications.

Next, we tested a baseline implementation of psweep on HAPS-54 by utilizing a single VLX330

device. The communication between the host PC and the HAPS-54 board was realized via Ethernet.

That is, we integrated the aforementioned custom Ethernet arbiter and we employed a HAPS

daughter-board, namely GEPHY, to realize an 81 Mbps Ethernet link (actual measured bandwidth).

The FPGA resource utilization is 9346 LUT6 (5%), 8346 DFF (4%), 312 RAMB18 (54%) and 53 DSP

(28%). In practice, by using the on-board clock switches, the maximum clock frequency increased

up to 280 MHz. The correctness of the HW results was validated by comparing to the VHDL

simulator’s results (test vectors). Hence, the functionality analysis performed in sections 3.3 and

4.3 also apply to our multi-FPGA implementations. Notice that the memory utilization exceeds the

capacity of NG-MEDIUM, and therefore, we proceed to our multi-FPGA approach.



Fig. 12. The HAPS-54 test platform, with 4 XC5VLX330 FPGAs and daughter-boards used for multi-FPGA

evaluation (left). Four distinct partitioning schemes utilizing 2–4 FPGAs on the HAPS-54 platform (right).

Table 6. Resources of 2- and 3-FPGA implementations, with interconnection and MHz measured on-board.

Resources Double-FPGA Design Triple-FPGA Design(XC5VLX330) Device A Device B Device A Device B Device C

LUTs 2913 (1%) 6068 (3%) 2911 (1%) 3052 (1%) 2851 (1%)

DFFs 2632 (1%) 6170 (3%) 2680 (1%) 3125 (2%) 2665 (1%)

RAMB18s 109 (19%) 203 (35%) 108 (19%) 174 (30%) 27 (5%)

DSPs 0 (0%) 55 (29%) 0 (0%) 10 (5%) 45 (23%)

Interconnections 59 bits 82 bits

Max. Frequency 280 MHz 224 MHz

Intercon. Activity 0.08 Gbps 5 Gbps

To assess our methodology, we assumed multiple partitioning scenarios and tested all result-

ing implementations. In all cases, we paid particular attention to the selection of the routing

resources for realizing the inter-FPGA network and we concluded that increasing the number of

traces/connections increases the possibility of de-synchronization (either from the start, or during

the course of processing the images). Among all successful implementations, Fig. 12, right, illustrates

4 representative partitioning topologies utilizing two, three and four FPGAs. The blue-colored

buses indicate the I/O data transfers (e.g., pixels), while the black-colored arrows correspond to

the control signals. Tables 6 and 7 report the VLX330 utilization, together with the total intercon-

nection traces of each network, the maximum clock rate measured on the board, and the total



Table 7. Resources of 4-FPGA implementations in comm-isolated and ring topologies.

Resources Quadruple-FPGA Design Ring Quadruple-FPGA Design(XC5VLX330) Device A Device B Device C Device D Device A Device B Device C Device D

LUTs 2275 (1%) 2907 (1%) 1588 (1%) 1329 (1%) 2818 (1%) 1741 (1%) 3482 (2%) 1224 (1%)

DFFs 2560 (1%) 3124 (2%) 1558 (1%) 1122 (1%) 2687 (1%) 1832 (1%) 2665 (1%) 1237 (1%)

RAMB18s 109 (19%) 176 (31%) 24 (4%) 0 (0%) 109 (19%) 58 (10%) 26 (5%) 117 (20%)

DSPs 0 (0%) 10 (5%) 45 (23%) 0 (0%) 0 (0%) 10 (5%) 45 (23%) 0 (0%)

Interconnect. 187 bits 93 bits

Max. Freq. 172 MHz 224 MHz

Inter. Activity 22 Gbps 5.1 Gbps

information transferred among devices (interconnection activity). We note that a rough estimation

of the NG-MEDIUM utilization can be made via simple comparison to its available resources (34K

LUT4, 32K DFF, 112 DSPs, 56 RAMB48 with 2.7 Mbits).

The most balanced partitioning was achieved with the quadruple implementation in a “ring”

topology (Table 7). To achieve high clock rates in this ring, i.e., 224 MHz, we avoided common

broadcasting lines for control signals by using duplicates, and synced the 93 traces to sustain 5.1

Gbps flow on the board. This figure is better than the excessive 22 Gpbs of the first quadruple

topology (Table 7). Due to its pipelined operation, the ring’s B-C-D devices work in parallel and

sustain 99% time utilization (engaged in processing). The 3-FPGA design is also very competitive,

with very regular interconnections (Table 6). Still, it cannot fit in NG-MEDIUM due to insufficient

memory, a frequent bottleneck for image processing on FPGAs. When considering NG-MEDIUM,

the only feasible scenario for psweep is the 4-FPGA in ring topology, with a maximum utilization

of 78% RAM (in device D, for the map), 10% LUT (in C, for MNCC), and 40% DSP (in C).

A fair comprehensive comparison to previous works is infeasible due to the scarce number of

similar multi-FPGA publications. Nevertheless, compared to [Choi and Rutenbar 2016] which also

uses 4 VLX330 for stereo vision, our methodology allowed a much more fine-grained partitioning

of the design and higher clock rates instead of simple function replication for mere frame-level

parallelization. Similarly, compared to [Lee et al. 2013] that uses HAPS-64 for ray-tracing, our

methodology has led to better inter-FPGA communication by respecting pin constraints and

avoiding extra HW resources, e.g., AXI-AHB bridges to interface the FPGAs, even thoughwe achieve

up to 22 Gbps being transferred among the chips. Overall, due to its demonstrated optimizations,

our multi-FPGA methodology can provide significant acceleration of stereo processing, even when

project-specific constraints mandate the use of small devices.

6 CONCLUSION

In an effort to enhance the embedded stereo vision capabilities of future planetary rovers, this

paper has presented the development and hardware acceleration of psweep, a plane sweep variant.

psweep was experimentally demonstrated to accurately reconstruct in 3D various Mars-like scenes

with mean error less than 2 cm at 4 m depth (and less than 8 mm if large errors are excluded).

Based on pixel-level pipelining, parallel architecture design, on-the-fly processing and fine-tuning

via parametric VHDL, our FPGA psweep maintains the accuracy of the SW implementation while

processing a 2.5 Mpixel stereo image in only 1.7 sec. With a cost of only 5.4K LUT and 100 RAMB36

on a xc6vlx240t-2, this FPGA achieves speed-up factors of 32x compared to a desktop CPU (Intel

core i5-4590) and 2810x against a space-grade processor (LEON3 at 50MHz). When considering

space-grade technology, our HWminimization techniques allow psweep plus an Ethernet controller

to fit in a single Virtex5QV and process a 2.5 Mpixel image in 3.1 sec. Furthermore, to leverage the

use of limited-size space FPGAs, we devised a custom methodology for multi-FPGA partitioning



and demonstrated various implementations on sets of 2−4 example devices, e.g., NG-MEDIUM

emulated on the HAPS-54 multi-FPGA platform. Our final HW/SW embedded system meets all

the demanding requirements set by ESA in terms of accuracy, speed and HW cost. Compared to

similar published works, our FPGA solution proves HW-efficient and very low-cost, hence suitable

for supporting autonomous rover navigation in planetary exploration scenarios.

ACKNOWLEDGMENTS

The authors thank Marcos Avilés Rodrigálvarez from GMV, Spain for porting the C code on LEON3,

as well as Gianfranco Visentin from ESTEC/ESA, the Netherlands for useful discussions. This work

was supported by the European Space Agency via the SEXTANT and COMPASS projects of the

ETP-MREP research programme (ESTEC refs. 4000103357/11/NL/EK and 4000111213/14/NL/PA).

REFERENCES

Kristian Ambrosch and Wilfried Kubinger. 2010. Accurate Hardware-Based Stereo Vision. Computer Vision and ImageUnderstanding 114, 11 (2010), 1303–1316.

Max Bajracharya, Mark W. Maimone, and Daniel Helmick. 2008. Autonomy for Mars Rovers: Past, Present, and Future.

Computer 41, 12 (Dec. 2008), 44–50.Christian Banz et al. 2010. Real-Time Stereo Vision System Using Semi-Global Matching Disparity Estimation: Architecture

and FPGA-Implementation. In Int’l Conf. on Embedded Comp. Sys.: Architectures, Modeling & Simulation. 93–101.Jungwook Choi and Rob A. Rutenbar. 2016. Video-Rate Stereo Matching Using Markov Random Field TRW-S Inference on a

Hybrid CPU+FPGA Computing Platform. IEEE Trans. Circuits Syst. Video Technol. 26, 2 (2016), 385–398.Giuseppe Cocorullo, Pasquale Corsonello, Fabio Frustaci, and Stefania Perri. 2016. An Efficient Hardware-Oriented Stereo

Matching Algorithm. Microprocessors and Microsystems 46 (2016), 21–33.Robert T. Collins. 1996. A Space-Sweep Approach to True Multi-Image Matching. In Proc. Conf. on Computer Vision and

Pattern Recognition (CVPR’96). 358–363.Ahmad Darabiha, W. James MacLean, and Jonathan Rose. 2006. Reconfigurable Hardware Implementation of a Phase-

Correlation Stereo Algorithm. Machine Vision and Applications 17, 2 (May 2006), 116–132.

Alex Ellery. 2015. Planetary Rovers: Robotic Exploration of the Solar System. Springer Berlin Heidelberg.

David Gallup, Jan-Michael Frahm, PhilipposMordohai, Qingxiong Yang, andMarc Pollefeys. 2007. Real-Time Plane-Sweeping

Stereo with Multiple Sweeping Directions. In Proc. Conf. on Computer Vision and Pattern Recognition (CVPR’07). 1–8.David Gallup, Jan-Michael Frahm, Philippos Mordohai, and Marc Pollefeys. 2008. Variable Baseline/Resolution Stereo. In

Proc. Conf. on Computer Vision and Pattern Recognition (CVPR’08). 1–8.Pierre Greisen, Simon Heinzle, Markus Gross, and Andreas P. Burg. 2011. An FPGA-Based Processing Pipeline for High-

Definition Stereo Video. EURASIP Journal on Image and Video Processing 2011, 1 (2011), 18.

Heiko Hirschmüller and Daniel Scharstein. 2009. Evaluation of Stereo Matching Costs on Images with Radiometric

Differences. IEEE Trans. Pattern Anal. Mach. Intell. 31, 9 (2009), 1582–1599.Minxi Jin and Tsutomu Maruyama. 2014. Fast and Accurate Stereo Vision System on FPGA. ACM Trans. Reconfigurable

Technol. Syst. 7, 1 (Feb. 2014), 1–24.Seunghun Jin et al. 2010. FPGA Design and Implementation of a Real-Time Stereo Vision System. IEEE Trans. Circuits Syst.

Video Technol. 20, 1 (Jan. 2010), 15–26.Ioannis Kostavelis et al. 2014. SPARTAN: Developing a Vision System for Future Autonomous Space Exploration Robots. J.

Field Robotics 31, 1 (2014), 107–140.Joel Le Mauff. 2018. From eFPGA cores to RHBD System-On-Chip FPGA (NanoXplore’s presentation of the NG-MEDIUM

rad-hard FPGA). https://indico.esa.int/event/232/contributions/2137/attachments/1820/2121/2018-04_NX-From_eFPGA_

cores_to_RHBH_SoC_FPGAs-JLM-v2.pdf. (2018). 4th SEFUW workshop, ESTEC/ESA, Noordwijk, NL, 9 April 2018.

Jaedon Lee, Youngsam Shin, Won-Jong Lee, Soojung Ryu, and Jeongwook Kim. 2013. Real-time Ray Tracing on Coarse-

Grained Reconfigurable Processor. In IEEE Int’l Conf. on Field-Programmable Technology (FPT). 192–197.George Lentaris, Dionysios Diamantopoulos, Kostas Siozios, Dimitrios Soudris, and Marcos Avilés Rodrigálvarez. 2012.

Hardware Implementation of Stereo Correspondence Algorithm for the ExoMars Mission. In IEEE Int’l Conf. on FieldProgrammable Logic and Applications (FPL). 667–670.

George Lentaris, Konstantinos Maragos, Ioannis Stratakos, Lazaros Papadopoulos, Odysseas Papanikolaou, Dimitrios Soudris,

Manolis Lourakis, Xenophon Zabulis, David Gonzalez-Arjona, and Gianluca Furano. 2018. High Performance Embedded

Computing in Space: Evaluation of Platforms for Vision-Based Navigation. J. Aerosp. Inf. Syst. 15, 4 (April 2018), 178–192.George Lentaris, Ioannis Stamoulias, Dimitrios Soudris, and Manolis Lourakis. 2016. HW/SW Co-design and FPGA

Acceleration of Visual Odometry Algorithms for Rover Navigation on Mars. IEEE Trans. Circuits Syst. Video Technol. 26,


https://indico.esa.int/event/232/contributions/2137/attachments/1820/2121/2018-04_NX-From_eFPGA_cores_to_RHBH_SoC_FPGAs-JLM-v2.pdf

https://indico.esa.int/event/232/contributions/2137/attachments/1820/2121/2018-04_NX-From_eFPGA_cores_to_RHBH_SoC_FPGAs-JLM-v2.pdf


8 (Aug. 2016), 1563–1577.

Jing Liu, Chunpeng Li, Feng Mei, and Zhaoqi Wang. 2015. 3D Entity-Based Stereo Matching With Ground Control Points

and Joint Second-Order Smoothness Prior. The Visual Computer 31, 9 (Sept. 2015), 1253–1269.Manolis Lourakis and Xenophon Zabulis. 2013. Accurate Scale Factor Estimation in 3D Reconstruction. In Intl. Conf. on

Computer Analysis of Images and Patterns (CAIP). Springer Berlin Heidelberg, 498–506.

Mark W. Maimone, P. Chris Leger, and Jeffrey J. Biesiadecki. 2007. Overview of the Mars Exploration Rovers’ Autonomous

Mobility and Vision Capabilities. In Int’l Conf. on Robot. Autom. (ICRA), Space robotics workshop.Larry Matthies et al. 2007. Computer Vision on Mars. Int. J. Comput. Vision 75, 1 (2007), 67–92.

Richard H. Maurer, Martin E. Fraeman, Mark N. Martin, and David R. Roth. 2008. Harsh Environments: Space Radiation

Environment, Effects, and Mitigation. Johns Hopkins APL Technical Digest 28, 1 (2008), 17–29.Hans P. Moravec. 1977. Towards Automatic Visual Obstacle Avoidance. In Proc. Int’l Joint Conf. on AI (IJCAI). 584–594.Mikhail G. Mozerov and Joost van de Weijer. 2015. Accurate Stereo Matching by Two-Step Energy Minimization. IEEE

Trans. Image Process. 24, 3 (March 2015), 1153–1163.

Kyprianos Papadimitriou, Sotiris Thomas, and Apostolos Dollas. 2013. An FPGA-Based Real-Time System for 3D Stereo

Matching, Combining Absolute Differences and Census with Aggregation and Belief Propagation. In IFIP/IEEE Int’l Conf.on VLSI-SoC. Springer, 168–187.

Madaín Pérez-Patricio and Abiel Aguilar-González. 2015. FPGA Implementation of an Efficient Similarity-Based Adaptive

Window Algorithm for Real-time Stereo Matching. Journal of Real-Time Image Processing (Sept. 2015).

Marc Pollefeys et al. 2008. Detailed Real-Time Urban 3D Reconstruction from Video. Int. J. Comput. Vision 78, 2-3 (July

2008), 143–167.

Marc Pollefeys and Sudipta Sinha. 2004. Iso-Disparity Surfaces for General Stereo Configurations. In Proc. Europ. Conf. onComputer Vision ECCV, Vol. III. Springer, 509–520.

Kyle Rupnow, Yun Liang, Yinan Li, Dongbo Min, Minh Do, and Deming Chen. 2011. High level synthesis of stereo matching:

Productivity, performance, and software constraints. In Field-Programmable Technology (FPT), 2011 Int’l Conf. IEEE, 1–8.Daniel Scharstein and Richard Szeliski. 2002. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence

Algorithms. Int. J. Comput. Vision 47, 1-3 (April 2002), 7–42.

Daniel Scharstein and Richard Szeliski. 2015. Middlebury Stereo Evaluation - Version 2. http://vision.middlebury.edu/stereo/

eval/. (2015). Accessed 2017-11-21.

Thomas Schöps, Torsten Sattler, Christian Häne, and Marc Pollefeys. 2017. Large-Scale Outdoor 3D Reconstruction on a

Mobile Device. Computer Vision and Image Understanding 157 (2017), 151–166.

Yi Shan et al. 2014. Hardware Acceleration for an Accurate Stereo Vision System Using Mini-Census Adaptive Support

Region. ACM Trans. Embed. Comput. Syst. 13, 4 (April 2014), 132:1–132:24.Synopsys. 2017. HAPS Family of Physical Prototyping Solutions. https://www.synopsys.com/verification/prototyping/haps.

html. (2017). Accessed 2017-07-10.

Matteo Tomasi, Mauricio Vanegas, Francisco Barranco, Javier Diaz, and Eduardo Ros. 2012. Real-Time Architecture for a

Robust Multi-Scale Stereo Engine on FPGA. IEEE Trans. on VLSI Syst. 20, 12 (Dec. 2012), 2208–2219.Christos Ttofis, Christos Kyrkou, and Theocharis Theocharides. 2015. A Hardware-Efficient Architecture for Accurate

Real-Time Disparity Map Estimation. ACM Trans. Embed. Comput. Syst. 14, 2 (Feb. 2015), 36:1–36:26.George Vogiatzis, Carlos Hernández Esteban, Philip H.S. Torr, and Roberto Cipolla. 2007. Multiview Stereo via Volumetric

Graph-Cuts and Occlusion Robust Photo-Consistency. IEEE Trans. Pattern Anal. Mach. Intell. 29, 12 (Dec. 2007), 2241–2246.Wenqiang Wang, Jing Yan, Ningyi Xu, Yu Wang, and Feng-Hsiung Hsu. 2015. Real-Time High-Quality Stereo Vision System

in FPGA. IEEE Trans. Circuits Syst. Video Technol. 25, 10 (Oct. 2015), 1696–1708.Qingxiong Yang et al. 2009. Stereo Matching with Color-Weighted Correlation, Hierarchical Belief Propagation, and

Occlusion Handling. IEEE Trans. Pattern Anal. Mach. Intell. 31, 3 (March 2009), 492–504.

Qingqing Yang, Pan Ji, Dongxiao Li, Shaojun Yao, and Ming Zhang. 2014. Fast Stereo Matching Using Adaptive Guided

Filtering. Image Vision Comput. 32, 3 (March 2014), 202–211.

Ruigang Yang and Marc Pollefeys. 2003. Multi-Resolution Real-Time Stereo on Commodity Graphics Hardware. In Proc.Conf. on Computer Vision and Pattern Recognition (CVPR’03). 211–217.

Kuk-Jin Yoon and In So Kweon. 2006. Adaptive Support-Weight Approach for Correspondence Search. IEEE Trans. PatternAnal. Mach. Intell. 28, 4 (April 2006), 650–656.

Xenophon Zabulis, Georgios Kordelas, Karsten Müller, and Aljoscha Smolic. 2006. Increasing the Accuracy of the Space-

Sweeping Approach to Stereo Reconstruction, Using Spherical Backprojection Surfaces. In Proc. Int’l Conf. on ImageProcessing (ICIP). 2965–2968.

Yunlong Zhan, Yuzhang Gu, Kui Huang, Cheng Zhang, and Keli Hu. 2016. Accurate Image-Guided Stereo Matching With

Efficient Matching Cost and Disparity Refinement. IEEE Trans. Circuits Syst. Video Technol. 26, 9 (Sept. 2016), 1632–1645.Paolo Zicari, Stefania Perri, Pasquale Corsonello, and Giuseppe Cocorullo. 2012. Low-Cost FPGA Stereo Vision System for

Real Time Disparity Maps Calculation. Microprocessors and Microsystems 36, 4 (2012), 281–288.


http://vision.middlebury.edu/stereo/eval/

http://vision.middlebury.edu/stereo/eval/

https://www.synopsys.com/verification/prototyping/haps.html

https://www.synopsys.com/verification/prototyping/haps.html

Single- and multi-FPGA Acceleration of Dense...

Documents

Transcript of Single- and multi-FPGA Acceleration of Dense...