Direct Stereo Visual Odometry Based on Lines...Direct Stereo Visual Odometry Based on Lines Thomas...

Direct Stereo Visual Odometry Based on Lines

Thomas Holzmann, Friedrich Fraundorfer and Horst BischofInstitute for Computer Graphics and Vision

Graz University of Technology, Austria{holzmann, fraundorfer, bischof}@icg.tugraz.at

Keywords: Visual Odometry, Pose Estimation, Simultaneous Localisation And Mapping

Abstract: We propose a novel stereo visual odometry approach, which is especially suited for poorly textured environ-ments. We introduce a novel, fast line segment detector and matcher, which detects vertical lines supportedby an IMU. The patches around lines are then used to directly estimate the pose of consecutive camerasby minimizing the photometric error. Our algorithm outperforms state-of-the-art approaches in challengingenvironments. Our implementation runs in real-time and is therefore well suited for various robotics andaugmented reality applications.

1 INTRODUCTION

In the last years, Simultaneous Localization andMapping (SLAM) and Visual Odometry (VO) be-came an increasingly popular field of research. Incomparison to Structure from Motion (SfM), whichis not constrained to run in realtime, SLAM and VOmethods should be able to run in realtime even onlightweight mobile computing solutions.

Visual localization and mapping approaches areoften used for small robotics devices (e.g. MicroAerial Vehicles (MAVs)), where it is not possible touse heavier and more energy-consuming sensors likelaser scanners. However, most of these devices areequipped with an Inertial Measurement Unit (IMU),which can be used in conjunction with vision meth-ods to improve the results.

There exist widely used monocular SLAM/VOmethods, which just use one camera as the only in-formation source and therefore can be used on manydifferent mobile devices. In contrast to monocularapproaches, stereo VO methods using a stereo setupwith a fixed baseline have some benefits, especiallywhen used for Robotics: (1) Monocular approachescannot correctly determine the metric scale of a scene,while stereo approaches do have a correct scale asthe baseline of the camera setup is fixed. Addition-ally, monocular approaches also have a drift in scale.(2) Monocular approaches do have problems withpure rotations as no new 3D scene information canbe computed from images without translational move-ment. Contrary, stereo approaches do not have this

problems, as a stereo rig already contains translation-ally moved cameras to triangulate scene informationregardless of the motion of the cameras.

Based on the observation that man-made scenesusually contain strong line structures, like, e.g, doorsor windows on white walls, we especially want touse these structures in our approach. Current feature-point based methods (e.g. (Geiger et al., 2011)) haveproblems in these scenes, as the lack of texture leadsto a low number of detected feature points. Directpose estimation methods (e.g. (Engel et al., 2014))directly use the image information to align consecu-tive frames and do not use explicitly detected featurepoint correspondences to estimate the camera pose.Therefore, direct methods perform better in poorlytextured environments, as they use more image infor-mation and do not rely on detected feature points inthe scene.

In this paper, we present a pose estimation methodspecifically targeted at poorly textured indoor sceneswhere other methods deliver poor results. We esti-mate the pose of a camera by minimizing the photo-metric error of patches around lines. We introducea fast line detector which detects vertical lines aidedby an inertial measurement unit (IMU). In case toofew lines could be detected, we additionally detectpoint features and use patches around the detectedpoints for direct pose estimation. In our experimentswe show that our method outperforms state-of-the-artmonocular and stereo approaches in poorly texturedindoor scenes and delivers comparable results on welltextured outdoor scenes.

To summarize, our main contributions are:

• A direct stereo VO method using lines running inreal-time.

• A novel efficient line detection algorithm aided byan IMU, that detects vertical lines.

• A fast line matching technique using a lightweightline descriptor.

2 RELATED WORK

A widely used approach to estimate the pose ofa camera using the image data only is to use fea-ture points in images. Feature points are detected andmatched between subsequent images and used to es-timate the pose of the camera. A popular monocu-lar point-based SLAM-approach is PTAM (Klein andMurray, 2007). It runs in real-time and is specificallydesigned to track a hand-held camera in a small aug-mented reality workspace. However, there also existadoptions tailored at large-scale environments ((Meiet al., 2010), (Weiss et al., 2013)). As it needs to de-tect and match feature points, the environment has tocontain sufficient texture suited for the feature pointdetection algorithm. Additionally, as it is a monocularapproach, it has difficulties in handling pure rotationsand cannot estimate a correct scale of the scene. Afeature-based approach, which uses a stereo cameraas input and therefore does not have this problems isLibviso (Geiger et al., 2011). However, as it also usesexplicitly detected feature points, the texture needs tobe adequate for the detection algorithm. Additionally,as it is not a SLAM method like PTAM but just a VOmethod, it does not compute a global map of the en-vironment but just computes the relative pose fromframe to frame.

Instead of using points, also lines can be detectedand matched in order to compute the pose of a cam-era. In (Elqursh and Elgammal, 2011), Elqursh et al.propose to estimate the relative pose of two camerasby using three lines having a special primitive config-uration. This has the advantage that no texture at allneeds to be present, but just a special line configura-tion has to exist. However, as explicit line detectionis relatively slow, this approach cannot be used on alow-end onboard computer.

Recently, direct pose estimation algorithms be-came popular. In comparison to feature point-basedapproaches, direct approaches do not explicitly detectand match features and compute the pose using thefeature matches, but compute the new pose by mini-mizing the photometric error over the whole or overbig parts of the image. For example, DTAM (New-

combe et al., 2011) tracks the pose of a monocularcamera given a dense model of the scene, by minimiz-ing the photometric error of the current image accord-ing to the whole model. As this is a computationalexpensive task, a high-performance GPU is necessaryto compute the pose in real-time. Similarly, LSD-SLAM (Engel et al., 2014) computes the pose by min-imizing the photometric error. However, to reducecomputational cost, it just uses areas in the imagewhere the image gradient is sufficiently high. There-fore, it runs in real-time even on hand-held deviceslike smartphones. Also an extension of LSD-SLAMto a stereo setup has been proposed recently (Engelet al., 2015). In (Forster et al., 2014), an approachis proposed which uses both explicit feature pointmatching and direct alignment and is therefore calledsemi-direct visual odometry (SVO). It detects featurepoints at keyframes and computes the poses of im-ages between keyframes by minimizing the photomet-ric error of patches around the feature points. It runsvery fast even on onboard computers and is specif-ically designed for a downwards looking camera ofa micro aerial vehicle. In summary, direct methodsperform direct image alignment to estimate the poseof a new frame with respect to previously computed3D information by optimizing the pose parameters di-rectly and by minimizing a photometric error. This isin contrast to feature-based methods, where featuresare detected and matched to compute the fundamen-tal matrix, and optimization is just optionally used torefine the computed poses by using the reprojectionerror.

Our method is a direct method, but utilizes a stereocamera rig instead of a monocular camera. In contrastto all previous related work, we detect lines and esti-mate the pose by minimizing the photometric errorof patches around the lines. Using our line detector,it is possible to detect lines very fast in constrainedimages. As in man-made environments most of thestructure consists of lines, using patches around linesis usually sufficient to directly align consecutive im-ages.

3 DIRECT VISUAL ODOMETRYBASED ON VERTICAL LINES

In our approach, we explicitly use the structurescontained in man-made environments to introduce anovel visual odometry approach.

Instead of directly aligning the whole image orexplicitly detecting and matching feature points, weestimate the camera pose by aligning patches justaround detected lines by minimizing the photometric

Input: Stereo Images

IMU Measurements

Triangulate New 3D Information

Gravity Alignment

Direct Pose Estimation

Propagate 3D Map To Current View

Feature Detection (Lines / Points)

Keyframe

Output: Camera Pose

Local Map

Figure 1: Overview of the processing pipeline. Our algo-rithm directly estimates the pose of the camera by minimiz-ing the photometric error of patches around vertical lines.For every keyframe, the 3D information from the previouskeyframe gets projected into the current one, new lines andoptionally points get detected and new 3D information istriangulated within the stereo setup.

error. In case no lines can be detected, feature pointsare detected and patches around the points are usedfor image alignment.

3.1 System Overview

Our approach is a keyframe-based stereo approach,which takes rectified stereo images and optionallysynchronized IMU values as input. As preprocess-ing, we align the images according to the gravity di-rection using the IMU measurements. Without IMUmeasurements, our approach also works on imagesacquired with roughly gravity aligned cameras.

For every keyframe, we detect and match lines us-ing gravity-aligned stereo images. If not enough linescan be matched, we additionally detect and matchpoint features. We triangulate this information to geta mapping of the visible information in 3D. A newkeyframe is selected when the photometric proper-ties of the images changed too much or the movementw.r.t the last keyframe is too big.

For every image, the pose gets estimated by directimage alignment: We project all triangulated informa-tion of the previous keyframe into the current imageand estimate the pose by minimizing the photometricerror of patches around the lines and points.

Additionally, for every consecutive keyframe, weproject lines and points from the previous one into thecurrent one to reuse already computed 3D informationif the same lines or points are detected.

In Fig. 1, one can see a schematic overview of ourprocessing pipeline.

Figure 2: Image alignment according to the gravity direc-tion. Left: An unaligned image from a forwards lookingcamera Right: Image aligned using the IMU. As one cansee, having an image aligned according to the gravity direc-tion, straight, vertical lines in the 3D scene are mapped tostraight, vertical lines in the image.

3.2 Fast Line Detection and Matching

Areas around lines contain high gradient values whichis an important property for direct image alignment.Therefore, we introduce a fast line detection andmatching method, where the area around lines is usedfor direct image alignment afterwards. We computethe depth of the lines by computing the disparity ofthe lines using the stereo image pair.

3.2.1 Line Detection

Usually, line detection (like, e.g. (Grompone et al.,2010)) is too slow for real-time applications (Hoferet al., 2014). To make the problem feasible, we focuson the detection of vertical lines, which can be donemuch faster. Man-made environments usually containmany vertical lines. Especially indoors, where thetexture available very often mainly consists of lines,it is crucial for a visual odometry approach to usealso this information. We introduce a detection al-gorithm aided by IMU information to make this chal-lenging line detection step feasible: Having cameraswith arbitrary poses and synchronized IMU informa-tion as input, the images can be aligned with the grav-ity direction using the simultaneously acquired IMUdata (see Fig. 2). As the angular accuracy of IMUs isquite high (e.g., evaluation of the IMU of the Pixhawkautopilot (Cortinovis, 2010)) and most of the mobilecomputing devices already contain built-in IMUs, thispreprocessing step can be done with many availablemobile computing devices.

Even if the alignment is not completely perfect,it does not influence the pose estimation step, as thelines are not directly used for pose estimation but in-stead the patches around them. Errors in the align-ment just result in a poorer line detection performancewhich can lead to shorter detected lines.

Assuming gravity aligned images as input, thetask of detecting vertical lines gets easier as real-world vertical lines are now projected to vertical lines

Figure 3: Vertical lines detected with our line detection al-gorithm. One can see that the lines are not detected overthe whole length all the time. This is due to slight gravitymisalignment of the images or due to changing light condi-tions.

in the image. Using this assumption, our algorithmfirst computes the image gradient in horizontal direc-tion by convolving the image with the Sobel operator.Next, for every column of the image, a histogram ofthe gradient values is created. When a histogram binassociated with sufficiently high gradient values con-tains enough elements, a line is detected in this col-umn.

As the lines are detected with pixel accuracy and,therefore, the number of lines is limited by the num-ber of columns, we have a small and limited numberof lines which increases the speed of the followingline matching step.

After the detection of lines, we apply a simplenon-maxima suppression: Lines get accepted, if theircorresponding gradient histogram bin contains moreelements than every gradient histogram bin corre-sponding to nearby lines.

Finally, the start and end point of the line is de-tected. In the gradient image, we search along theassociated column for the first pixel belonging to thesame bin as the corresponding line bin. This definesthe start point of the line. Next, we search for a pixelbelonging to a different bin. This defines the end pointof the line. In order to be robust against noise andshort discontinuities of the line, pixels with a similarhistogram bin and discontinuities of a few pixels arealso assigned to the line. Fig. 3 shows a result of theline detection.

3.2.2 Line Matching

We iterate over all lines in the left image and try tofind corresponding lines in the right image. Betweena predefined minimum and maximum disparity, westore all lines as potential matches which were de-

tected having a similar gradient histogram bin as theleft line. Then, we compute the Sum of Absolute Dif-ferences (SAD) of a patch around the line in the leftimage and the potential line matches in the right im-age.

For this, we need a equally-sized patch around theline in the left and right image. However, as the epipo-lar constraints do not hold anymore for the gravityaligned images (the images have been rotated afterrectification), it is not possible to use just the max-imum and minimum y-positions of the lines, as itwould be possible in the original, not gravity-aligned,rectified left and right image. Instead, we first haveto project the line start and end points back to the un-aligned images, select the maximum and minimumline start and end point and project the points back tocompute a correct equally-sized patch for both lines.

Having computed the SAD for all potentialmatches, the line with the lowest SAD value in theright image is accepted as correct match for the linein the left image, if the SAD value is lower than athreshold.

We assume that the detected vertical lines are pro-jections of vertical lines in the 3D scene to the im-age. Therefore, to compute the depth of the line, wejust have to compute one disparity value for the wholeline. However, as the epipolar constraints do not holdfor the aligned image, we again project the lines intothe unaligned images to compute the line disparity.

3.3 Supporting Points

When lines exist in the scene, they deliver a good tex-ture for direct image alignment. However, also inman-made environments consisting mainly of lines,there are areas where no or just a few lines exist.Therefore we need a fall-back solution for the directimage alignment for images where not many or nolines can be detected. We selected to use patchesaround feature points in this case.

3.3.1 Feature Point Detection

If no sufficient area of the image is covered by de-tected lines, we detect point features in areas of theimage which are not covered by line features. To de-termine the coverage of features over the image, theimage is partitioned into a 5x5 grid of equally sizedcells. Every cell which contains a line or parts of aline is marked as covered, all other cells are marked asuncovered. If less than 20 of the 25 cells are markedas covered, point detection is applied. However, asin cells marked as covered already information fromlines exist, we just extract feature points in the uncov-ered cells.

For every uncovered cell, we detect FAST cor-ners (Rosten and Drummond, 2005) using an adaptivethreshold. If the number of detected corners falls un-der 10, we decrease the corner threshold, while if thenumber of detected corners exceeds 20, we increasethe threshold. Then, we store the determined thresh-olds for each cell for the next image to be processed.

3.3.2 Feature Point Matching

Having corresponding feature points in two imagesand a calibrated stereo setup, one can triangulate a 3Dpoint. Therefore, we need to find points in the rightimage corresponding to points in the left image.

To find point matches, we proceed as follows: Forevery detected feature point in the left image, wesearch for detected feature points in the right imagelying on the corresponding epipolar line. Therefore,the points have to be projected back into the originalunaligned image, as the epipolar constraint does nothold anymore for the aligned images.

We search for points between a minimum andmaximum disparity value to ignore too far and toonear points. Having found some potential matches,we compare the point from the left image with thepotential matching points in the right image by com-puting the SAD for a patch around the point. Thecorresponding point with the lowest SAD is detectedas match, if the SAD value is under a fixed threshold.

3.4 Pose Estimation by Direct ImageAlignment

Given 3D lines and points from the previouskeyframe, our pose estimation algorithm estimates thepose by aligning the patches around the lines andpoints in the keyframe with the patches around thelines and points projected into the current frame (seeFig. 4). Alignment is done by non-linear least squaresoptimization on the intensity image.

3.4.1 Definitions and Notation

In this section, we state some definitions and nota-tion conventions which are used in the following al-gorithm description.

In terms of notation, we denote matrices by boldcapital letters (e.g., T ) and vectors by bold lower-caseletters (e.g., ξ).

A rigid body transformation T ∈ SE(3) is definedas

T =

(R t0 1

)with R ∈ SO(3) and t ∈ R3. (1)

Figure 4: Direct image alignment using patches aroundlines. Left: Keyframe, where lines are detected and patchesaround lines (red boxes) are extracted. Right: Image toalign. Red boxes are shifted according to the proposed newpose until a minimal photometric error is reached.

Having a transformation T 0,i which transforms theorigin 0 into camera i and maps points from the co-ordinate frame of i to the origin 0, the same trans-formation can be assembled having a transformationT 0,i−1 from a previous camera i− 1 and a transfor-mation T i−1,i describing the transformation from thecamera i−1 to camera i:

T 0,i = T 0,i−1 ·T i−1,i. (2)

For the optimization, we need a minimal descrip-tion of transformations. Therefore, we use the Liealgebra se(3), which is the tangent space of the ma-trix group SE(3) at the identity, and its elements aredenoted as twist coordinates

ξ =

(ω

υ

)∈ R6, (3)

where ω defines the rotation and υ defines the transla-tion. As stated in (Ma et al., 2003), twist coordinatescan be mapped onto SE(3) by using the exponentialmap:

T (ξ) = exp(ξ). (4)

3.4.2 Non-Linear Least Squares Optimizationon Image Intensities

To estimate the 6 DoF pose of a new image framegiven the image patches around lines and points fromthe previous keyframe, we minimize the photomet-ric error for the twist ξ by applying non-linear leastsquares optimization. We keep the 3D lines and pointsfixed in the optimization step, as we assume to havealready accurately triangulated 3D information fromusing the calibrated stereo setup.

The quadratic photometric error is defined as fol-lows:

E(ξ)=n

∑i=1

(Ik f (pi)−I(π(pi,dk f (pi),ξ)))2 =

n

∑i=1

r2i (ξ),

(5)where Ik f (pi) is the intensity value of position piin the keyframe to align to, π(pi,dk f (pi),ξ) denotesthe warping function which warps point pi of thekeyframe into the current image and I is the currentintensity image. This function computes all n resid-ual values ri(ξ) for every pixel i.

In order to decrease the influence of outliers, i.e.patches with very high residual values, we use aCauchy Loss function to down-weight high residuals.

For performance reasons, we do not warp all pix-els of the patches in image Ik f to image I but justthe central top and bottom points of the patches.For residual computation we then compare patchesaround the warped points in I with the patches fromIk f . This simplification is valid for small patches andsmall motion between frames and also similarly usedby others (e.g. (Forster et al., 2014)).

For all of the image patches, we seek to find a so-lution for ξ which minimizes the total squared photo-metric error

ξ = argminξ

12

E(ξ). (6)

The minimum value of ξ occurs when the gradient iszero:

∂E∂ξ j

= 2n

∑i=1

ri∂ri

∂ξ j= 0,( j = 1, ...,6) (7)

As Eq. 6 is non-linear, we iteratively solve theproblem. In every iteration, we linearize around thecurrent state in order to determine a correction ∆ξ tothe vector ξ:

E(ξ+∆ξ)≈ E(ξ)+ J(ξ)∆ξ, (8)

where J(ξ) is the Jacobian matrix of E(ξ). It has asize of n x 6 with n as the number of residual valuesand is given by:

Ji, j(ξ) =∂Ei(ξ)

∂ξ j. (9)

Using this linearization, the problem can be stated asa linear least squares problem:

∆ξ = argmin∆ξ

12(J(ξ)∆ξ+E(ξ)). (10)

In our approach, we use the Levenberg-Marquardtalgorithm (Levenberg, 1944) (Marquardt, 1963) tosolve the linear least squares problem stated in Eq. 10.

By rearrangement and plugging Eq. 8 and Eq. 9into Eq. 7, we obtain the following normal equations:

J(ξ)T J(ξ)∆ξ =−J(ξ)T E(ξ). (11)

We solve this equation system by using QR decompo-sition.

Having computed the twist ξ, we can use Eq. 2and Eq. 4 to compute an updated camera pose

T 0,i = T 0,k f ·T (ξ), (12)

where T 0,i is the current pose, T 0,k f is the pose of thelast keyframe and T (ξ) is the movement from the lastkeyframe to the current frame.

If the IMU measurements used for the gravityalignment of the images were exact all the time, theproblem would be reducible to 4DoF, as then roll andpitch would be constantly 0. However, as the IMUmeasurements contain errors, also roll and pitch canvary. Therefore, we don’t fix roll and pitch but justallow small rotations.

Even though patches around lines do not contain alot horizontal structure, the horizontal structure con-tained in the patches is enough to align the imagescorrectly also in vertical direction.

3.5 Keyframe Selection and MapExtension

After a certain amount of motion, a new keyframe hasto be selected and consecutively new lines and possi-bly points have to be matched and triangulated.

We use a simple measure for keyframe selection,which proved to work well in our experiments: Af-ter the pose estimation step, we compute the averagephotometric error per pixel. If this error exceeds acertain threshold, the current frame is selected as newkeyframe. Additionally, if the translation and rotationw.r.t. the last keyframe exceeds a threshold, the cur-rent frame is selected as new keyframe.

For every keyframe, new lines and points are de-tected. However, before triangulating new lines orpoints, the already triangulated lines and points ofthe previous keyframe get projected into the currentkeyframe. If they are near a detected line/point inthe current keyframe, the already computed depthvalue gets assigned to the lines/points in the current

keyframe. All remaining detected lines get matchedand triangulates as described in Sec. 3.2 and addi-tional points are used if necessary as described inSec. 3.3.

4 EXPERIMENTS AND RESULTS

In this section, we first describe our implemen-tation used for the experiments in more detail andthen evaluate our method on our own challengingindoor dataset acquired with a flying MAV, on thepublicly available Rawseeds dataset (Bonarini et al.,2006) (Ceriani et al., 2009) and on a subset of theKITTI dataset (Geiger et al., 2012). We show thatour approach outperforms state-of-the-art methods onthe first two challenging indoor datasets and deliverscomparable results on the last dataset.

4.1 Implementation Details

We implemented our algorithm in C++ usingOpenCV as image processing library and CeresSolver (Agarwal et al., ) for the optimization tasks.

In our evaluation, we used the images in full size,as downsampled images with a small stereo setupbaseline significantly decreased the quality of the tri-angulated 3D information.

For line detection, we set the gradient histogrambin size to 20 which has shown to deliver good detec-tion results. In the non-maximum suppression step,we take into account lines with a distance not fartherthan 1.5% of the image width. For the matching, aline is matched with another one if their gradient his-togram bins do not differ by more than 2. Then, amatch is accepted if the photometric error of the patcharound the line has a smaller average photometric er-ror per pixel than 15.

In the optimization step, we initialize the pose es-timation of a new frame with the pose of the lastframe. In order to avoid random jumps and rotationswhen too few or erroneous measurements are usedin the optimization, we set upper and lower boundsfor maximum rotational and translational movementsw.r.t. the last keyframe: The maximum translationalmovement is set to 1 m and the maximum rotationalmovement is set 35 deg. These thresholds multipliedby 0.9 are also used for keyframe selection. In patchalignment, bilinear image interpolation is used to de-termine the interpolated image intensity and gradientvalues. For lines, we use a patch width of 21 pixelsand height according to the line length. For points weuse a patch size of 21x21 pixels.

4.2 Evaluation Setup

We evaluated our algorithm on a desktop PC whichis an Intel Core i7-4820K CPU having 4 cores with3.7 GHz, up to 8 threads and 16 GB of RAM. How-ever, our algorithm just used 4 threads for computa-tion.

We compared our approach against Lib-viso (Geiger et al., 2011), which is a sparsefeature-point based stereo visual odometry approach.We used it with subpixel refinement, set bucketheight and width to 100px, maximum features perbucket to 10 and match radius to 50, as these settingshave shown to deliver good results on our MAVdataset and the Rawseeds dataset. For one testsequence, we also compared with LSD-Slam (Engelet al., 2014), which is a direct monocular SLAMapproach. However, we didn’t use LSD-SLAM inthe other experiments, as the sequences are not suitedfor a monocular approach, which cannot handlepure rotations and has problems with pure forwardmotion. We disabled loop closure and global mapoptimization in LSD-SLAM by setting the doSLAMoption to f alse, as these trajectory improvementtechniques are also not used in our approach.

4.3 Own Test Sequences

To evaluate the performance of our algorithm inchallenging indoor environments where just few tex-ture is available, we captured an image sequenceby flying with an MAV in a room containing fewtexture elements. As capturing system, we usedan Asctec Pelican equipped with two forward andslightly downwards looking Matrix Vision BlueFox-MLC202b cameras with a baseline of approximately13.5cm (see Fig. 5). In this setup, each camera hasa horizontal field of view of 81.2 deg and acquiresimages with 20 frames per second (FPS) at a resolu-tion of 640x480 pixels. In order to capture synchro-nized images and IMU measurements, we externallytrigger the cameras with the Asctec Autopilot. Simul-taneously to the trigger signal, the Asctec Autopilotcaptures a timestamped IMU measurement. Both, theimages and the IMU measurements are then stored onthe Odroid XU3 Lite computation board onboard theMAV. We calibrated both the stereo camera setup andthe rotation from IMU to camera with the Kalibr tool-box (Furgale et al., 2013).

The sequence is captured while flying with theMAV in front of a wall in a broad hallway. The trajec-tory is similar to a rectangle, while both cameras arelooking in the direction of the wall all the time. Thestart and end points are nearly identical, therefore the

Figure 5: The Asctec Pelican used to acquire our own testdataset. It is equipped with an IMU (included in the autopi-lot), a stereo camera and an Odroid XU3 Lite processingboard.

Figure 6: Images of the sequence acquired with the flyingMAV. As can be seen, the scene mostly consists of whitewalls with little texture.

absolute trajectory error can be observed. This sce-nario is similar to an augmented reality application,where, for example, an MAV equipped with a laserprojector should project something onto the wall.

As can be seen in the input images in Fig. 6,the scene mostly consists of white, untextured walls.Only few structure elements are visible (doors, recy-cle bin).

We compare this sequence with Libviso (Geigeret al., 2011) and LSD-SLAM (Engel et al., 2014). AsLSD-SLAM is a monocular approach and thereforedoes not provide a metric scale, we manually alignedthe trajectory scale with the stereo approaches. How-ever, generally this trajectory should be well suited fora monocular approach, as there are no pure rotationsand no forward movement in the initialization phase.

The computed trajectories are plotted in Fig. 7. Incomparison to Libviso (blue), our approach (red) doesnot have a big drift. The start/end point differenceof our approach is approximately 2.5 m while Lib-viso has a difference of approximately 7 m. Libvisohas already a big translational drift when the MAVis standing still on the floor at the beginning of thetrajectory. While flying, Libviso accumulates a bigrotational drift, which leads to big changes in the z-

dimension. Contrary, LSD-SLAM estimates parts ofthe trajectory quite well. However, some random posejumps happen in the estimation, which yields to a tra-jectory which is partly far away from the correct so-lution. In total, our approach performs best comparedto the others, as it does not have a big drift and no bigoutliers.

4.4 Rawseeds

The Rawseeds datasets (Bonarini et al., 2006) (Ceri-ani et al., 2009) are publicly available indoor and out-door datasets captured with a ground robot equippedwith multiple sensors. Additionally, ground truth datawas captured for some parts of the trajectories with anexternal tracking system.

In our experiments, we chose to use the Bic-occa 2009-02-27a dataset, which is an indoor datasetcaptured in a static environment with mixed naturaland artificial lightning. Note that this is a really large-scale dataset, as the total trajectory length is 967.18m. We used the left and right images of the trinocularforward-looking camera for our experiments, whichhave a resolution of 640x480 pixels each, a baselineof 18 cm and were captured with 15 FPS. As the im-ages were acquired roughly aligned with the grav-ity direction, no IMU measurements are needed forour algorithm. Example input images can be seen inFig. 8.

As the ground truth computed with an externaltracking system does not cover the whole trajectory,we use the (publicly available) extended ground truthin our experiments. This ground truth also includestrajectory parts estimated with data of an onboardlaser scanner.

In the experiments for this dataset, we compareour approach with Libviso. In Tab. 1 one can seethe relative pose errors as proposed in (Sturm et al.,2012). Our approach gives better results for all com-puted relative pose errors. However, due to the dif-ficulty of the sequence, which also contains imageswith no texture at all (see Fig. 8), the error is also rel-atively high when using our algorithm. Additionally,the framerate should be higher for our approach. Asthis dataset is just captured with 15 FPS, sometimesthe scene just moves too fast in front of the cameraand the camera pose cannot be tracked exactly. How-ever, in the parts of the sequences where there existsome vertical line structure (e.g. doors, windows orshelves), our approach works better and therefore theerrors are lower.

−4 −2 0 2 4 6 8

−6

−4

−2

0

2

4

y (

me

ters

)

x (meters)

−4 −2 0 2 4 6 8

−6

−4

−2

0

2

4

y (

me

ters

)

x (meters)

−4 −2 0 2 4 6 8

−6

−4

−2

0

2

4

y (

me

ters

)

x (meters)

−4 −2 0 2 4 6 8

−7

−6

−5

−4

−3

−2

−1

0

1

2

3

4

z (

me

ters

)

x (meters)

Libviso

−4 −2 0 2 4 6 8

−7

−6

−5

−4

−3

−2

−1

0

1

2

3

4

z (

me

ters

)

x (meters)

Our Approach

−4 −2 0 2 4 6 8

−7

−6

−5

−4

−3

−2

−1

0

1

2

3

4

z (

me

ters

)

x (meters)

LSD-SLAMFigure 7: Estimated trajectories of Libviso (left, blue), our approach (middle, red) and LSD-SLAM (right, green) of the testsequence captured with an MAV. Top row: x- and y-axes (top view). Bottom row: x- and z-axes (side view). The start pointis at (0,0,0) and the trajectory end point, which should be near the start point is indicated with a circle. Even though ourapproach (red) has a slight drift (start point is not identical with end point), Libviso (blue) has a much higher rotational drift.LSD-SLAM works well for some parts of the sequence. However, occasionally it does some random jumps and moves faraway from the desired trajectory.

Figure 8: Images of the Rawseeds Bicocca 2009-02-27a se-quence. Top left: A big part of the images look similar tothis, where the robot moves through a narrow corridor. Topright: At some corridor intersections, where nearly pure ro-tations are performed, just the white wall is captured withthe cameras, which is extremely difficult for a VO algorithmBottom row: Also parts with better input data for VO exist,e.g., wider hallways and a library.

4.5 KITTI

The KITTI dataset (Geiger et al., 2012) is currently apopular dataset for visual odometry evaluation. It is

Table 1: Relative translational and rotational error of theRawseeds sequence as defined in (Sturm et al., 2012) perseconds of movement of our approach and Libviso. In allmetrics, our approach performs better.

Our Approach LibvisoTransl. RMSE (m/s) 0.665 0.753Transl. Median (m/s) 0.577 0.68Rot. RMSE (deg/s) 10.892 12.229Rot. Median (deg/s) 0.018 0.033

acquired with a stereo camera mounted on a car whichdrives on public streets, mostly through cities. Wecompare our algorithm against Libviso (Geiger et al.,2011) on a subset of the dataset (sequence 07).

The images have a resolution of 1241x376 pixelsand a framerate of 10 FPS. As the images are roughlyaligned with the gravity direction, there is no need toincorporate IMU measurements. Due to the low cap-ture framerate and the relatively fast movement of thecar, the image content can change quite rapidly. Thisis a problem for direct VO methods which delivergood results when having small inter-frame motions.Additionally, it contains enough texture for feature-based methods to work properly. However, we wantto show that our approach works also well on datasets

Figure 9: Images of the KITTI visual odometry dataset (se-quence 07). In this sequence, there are some lines detectablein some parts (e.g. in top image). In these parts, the detectedlines are beneficial for the VO algorithm. However, in someimages there exist no lines (e.g. image bottom) and our al-gorithm also depends on the detected feature points.

Table 2: Relative translational and rotational error of theKITTI sequence 07 computed with the KITTI vision bench-mark suite (Geiger et al., 2012) as mean over all possiblesubsequences of length (100,...,800) meters.

Our Approach LibvisoTransl. error (%) 8.0558 2.3676Rot. error (deg/m) 0.000744 0.000321

for which it is not specifically designed for.To overcome the low framerate, we had to change

the patch size of the patches around lines and pointsused in the optimization from 21 to 31. Additionally,we had to increase the maximal translational move-ment for the optimization step. Using this settingswe get results comparable to state-of-the-art methodswith the drawback of higher runtimes compared to theother datasets used in our evaluation. Additionally, asmore reliable points can be detected in this scene, wedecreased the line matching threshold so that morepoints are used.

Also for Libviso, we changed the parameter set-tings: We used it with default settings and subpixelrefinement.

In Fig. 10 the computed trajectories of Libviso andour approach are plotted. As one can see, the abso-lute trajectory error is slightly worse in our approach.Also the rotational and translational errors as definedin the KITTI vision benchmark suite are worse thanLibviso (see Tab. 2). However, our algorithm stillworks acceptable on a dataset like this one for whichit is not designed for.

4.6 Runtime

The mean computation time per frame for the firsttwo datasets is 49.3 ms, which is a framerate of 20.3

−200 −150 −100 −50 0 50−100

−50

0

50

100

150

x (meters)

y (

me

ters

)

Ground Truth

Libviso

Our approach

Figure 10: Evolving trajectory of the KITTI sequence. Onecan observe that our approach (red) performs comparable toLibviso (blue).

FPS. With this computation speed, both the Rawseedsdataset and our own MAV dataset can be processed inreal-time. Due to changed parameters for the KITTIdataset, the computation speed for the KITTI datasetis much slower and needs approximately 3 times theprocessing time of the other datasets.

Most of the processing time is needed for thepose estimation step, which could be optimized by amulti-scale approach, using SIMD instructions (SSE,NEON) or writing an own optimizer to overcome theoverhead of Ceres.

5 CONCLUSION

We have presented a direct stereo visual odome-try method, which detects and matches lines at everykeyframe using our novel fast line detection algorithmand estimates the pose of consecutive frames by directpose estimation methods using patches around verti-cal lines. These patches have proven to be a goodselection for indoor environments, which are poorlytextured, but where a lot of vertical structure exists.In our experiments, we have shown that our algorithmdelivers better results than state-of-the-art methods inchallenging indoor environments and comparable re-sults in well textured outdoor environments. As ourimplementation runs in real-time, it is suitable forvarious robotics and augmented reality applications.Future work will include subpixel refinement of theline detection step, global map optimization to mini-mize drift errors and to evolve it to a complete SLAMsystem, a multi-scale approach for the pose estima-tion step to deliver better accuracy for large inter-frame movements and improve computation speed,

and further speed improvements (e.g., using SSE andNEON instructions) to make it real-time capable alsoon small robotic platforms.

ACKNOWLEDGEMENTS

This project has been supported by the AustrianScience Fund (FWF) in the project V-MAV (I-1537).

REFERENCES

Agarwal, S., Mierle, K., and Others. Ceres solver.http://ceres-solver.org.

Bonarini, A., Burgard, W., Fontana, G., Matteucci,M., Sorrenti, D. G., and Tardos, J. D. (2006).Rawseeds: Robotics advancement through web-publishing of sensorial and elaborated extensivedata sets. In International Conference on Intelli-gent Robots and Systems.

Ceriani, S., Fontana, G., Giusti, A., Marzorati, D.,Matteucci, M., Migliore, D., Rizzi, D., Sor-renti, D. G., and Taddei, P. (2009). Rawseedsground truth collection systems for indoor self-localization and mapping. Autonomous Robots,27(4):353–371.

Cortinovis, A. (2010). Pixhawk - attitude and positionestimation from vision and imu measurementsfor quadrotor control. Technical report, Com-puter Vision and Geometry Lab, Swiss FederalInstitute of Technology (ETH) Zurich.

Elqursh, A. and Elgammal, A. M. (2011). Line-basedrelative pose estimation. In Proceedings IEEEConference Computer Vision and Pattern Recog-nition, pages 3049–3056. IEEE Computer Soci-ety.

Engel, J., Schops, T., and Cremers, D. (2014). Lsd-slam: Large-scale direct monocular slam. InProceedings European Conference on ComputerVision.

Engel, J., Stuckler, J., and Cremers, D. (2015). Large-scale direct slam with stereo cameras. In Interna-tional Conference on Intelligent Robots and Sys-tems.

Forster, C., Pizzoli, M., and Scaramuzza, D. (2014).Svo: Fast semi-direct monocular visual odome-try. In International Conference on Robotics andAutomation.

Furgale, P., Rehder, J., and Siegwart, R. (2013). Uni-fied temporal and spatial calibration for multi-sensor systems. In International Conference onIntelligent Robots and Systems.

Geiger, A., Lenz, P., and Urtasun, R. (2012). Are weready for autonomous driving? the kitti visionbenchmark suite. In Proceedings IEEE Confer-ence Computer Vision and Pattern Recognition.

Geiger, A., Ziegler, J., and Stiller, C. (2011). Stere-oscan: Dense 3d reconstruction in real-time. InIEEE Intelligent Vehicles Symposion.

Grompone, R., Jakubowicz, J., Morel, J. M., and Ran-dall, G. (2010). Lsd: A fast line segment detec-tor with a false detection control. IEEE Trans-actions on Pattern Analysis and Machine Intelli-gence, 32(4):722–732.

Hofer, M., Maurer, M., and Bischof, H. (2014). Im-proving sparse 3d models for man-made envi-ronments using line-based 3d reconstruction. InInternational Conference on 3D Vision.

Klein, G. and Murray, D. (2007). Parallel trackingand mapping for small ar workspaces. In Pro-ceedings International Symposium on Mixed andAugmented Reality.

Levenberg, K. (1944). A method for the solution ofcertain non-linear problems in least squares. InQuarterly of Applied Mathematics 2, pages 164–168.

Ma, Y., Soatto, S., Kosecka, J., and Sastry, S. S.(2003). An Invitation to 3-D Vision: From Im-ages to Geometric Models. Springer Verlag.

Marquardt, D. (1963). An algorithm for least-squaresestimation of nonlinear parameters. In SIAMJournal on Applied Mathematics 11(2), pages431–441.

Mei, C., Sibley, G., Cummins, M., Newman, P., andReid, I. (2010). Rslam: A system for large-scalemapping in constant-time using stereo. In Inter-national Journal of Computer Vision.

Newcombe, R. A., Lovegrove, S. J., and Davison,A. J. (2011). Dtam: Dense tracking and mappingin real-time. In Proceedings International Con-ference on Computer Vision, pages 2320–2327.

Rosten, E. and Drummond, T. (2005). Fusing pointsand lines for high performance tracking. In Pro-ceedings International Conference on ComputerVision.

Sturm, J., Engelhard, N., Endres, F., Burgard, W., andCremers, D. (2012). A benchmark for the eval-uation of RGB-D SLAM systems. In Interna-tional Conference on Intelligent Robots and Sys-tems, pages 573–580.

Weiss, S., Achtelik, M. W., Lynen, S., Achtelik,M. C., Kneip, L., Chli, M., and Siegwart, R.(2013). Monocular vision for long-term microaerial vehicle state estimation: A compendium.Journal of Field Robotics, 30(5):803–831.

Direct Stereo Visual Odometry Based on Lines...Direct Stereo Visual Odometry Based on Lines Thomas...

Documents

Transcript of Direct Stereo Visual Odometry Based on Lines...Direct Stereo Visual Odometry Based on Lines Thomas...