arXiv:2001.07791v2 [cs.CV] 31 Mar 2020 · 2020-04-02 · Depth Completion using a View Constrained...

Depth Completion using aView Constrained Deep Prior

Pallabi Ghosh1, Vibhav Vineet2, Larry S. Davis1, Abhinav Shrivastava1,Sudipta Sinha2, and Neel Joshi2

1 University of Maryland, College Park, MD, USA2 Microsoft Research, Redmond, WA, USA

Abstract. Recent work has shown that the structure of convolutionalneural networks (CNNs) induces a strong prior that favors natural im-ages. This prior, known as a deep image prior (DIP), is an effectiveregularizer in inverse problems such as image denoising and inpainting.We extend the DIP concept to apply to depth images. Given color imagesand noisy and incomplete target depth maps, we optimize a randomly-initialized CNN model to reconstruct an depth map restored by virtueof using the CNN network structure as a prior combined with a view-constrained photo-consistency loss, which is computed using images froma geometrically calibrated camera from nearby viewpoints. We apply thisdeep depth prior for inpainting and refining incomplete and noisy depthmaps within both binocular and multi-view stereo pipelines. Our quanti-tative and qualitative evaluation shows that our refined depth maps aremore accurate and complete, and after fusion, produces dense 3D modelsof higher quality.

1 Introduction

There are numerous approaches for measuring depth in a scene, such as usingbinocular [35] or multi-view [9,16,37] stereo or directly measuring depth withdepth cameras, LIDAR, etc. Each one of these approaches suffers from arti-facts, such as noise, inaccuracy, and incompleteness, due to the respective weak-nesses of each approach. As depth measurement is not a well posed problem,extensive research has been conducted to solve the problem using approximateinference and optimization techniques that employ appropriate priors and regu-larization [51,5,43,44].

Supervised learning methods based on convolutional neural networks (CNNs)have recently shown promise in improving depth measurements, both in thebinocular [55,29,22] and multi-view [21,53,20] stereo settings. However, these su-pervised methods rely on vast amounts of ground truth data to achieve propergeneralization. While unsupervised learning approaches have been explored [56,45,32],their success appears modest compared to supervised methods.

In this work, we propose a new approach for improving depth measurementsthat is inspired by the recent work by Ulyanov et al.[47], who demonstrated thatthe underlying structure of a encoder-decoder CNN induces a prior that favorsnatural images, a property they refer to as a “deep image prior” (DIP).

arX

iv:2

001.

0779

1v2

[cs

.CV

] 3

1 M

ar 2

020

2 Ghosh et al.

Fig. 1. (a) One of the input images (b)Target depth map computed using SGMand multiple neighboring images[19] (c)The refined depth map generated using ourdeep depth prior (DDP) based refinement.Notice that the depth map fills the holesin (b) (shown in white).

Ulyanov et al. show that the pa-rameters of a randomly initializedencoder-decoder CNN can be opti-mized to map a high-dimensional noisevector to a single image. When theimage is corrupted and the optimiza-tion is stopped at an appropriate pointbefore overfitting sets in, the networkoutputs a remarkably noise-free image.The DIP has since been used as a reg-ularizer in a number of low-level visiontasks such as image denoising and in-painting [47,50,12].

In this work, we propose usingDIP-based regularization for refining and in-painting noisy and incomplete depthmaps. Our approach is a post-processing, refinement step that can work on depthmaps from any source.

Using a network similar to Ulyanov et al., our approach generates a depthmap by combining a depth reconstruction loss with a view constrained photo-consistency loss. The latter loss term is computed by warping a color imageinto neighboring views using the generated depth map and then measuring thephotometric discrepancy between the warped image and the original image.

In this sense, our technique resembles direct methods proposed decades agofor image registration problems, which all employ some form of initialization anditerative optimization. However, instead of using handcrafted regularizers in theoptimization objective, we use the depth image prior as the regularizer. Whilethe role of regularization in end-to-end trainable CNN architectures is gaininginterest [26,52], our method is quite different, because there is no training andthe network parameters are optimized from scratch on new test images. Fig. 1shows the inpainting results of applying our technique(DDP) on an input depthmap with holes.

To the best of our knowledge, this is the first work to investigate deep imagepriors for improving depth images. We evaluate our approach using results frommodern stereo pipelines and depth cameras and show that the refined depthmaps are more accurate and complete leading to more complete 3D models.

2 Related Work

Stereo matching. Dense stereo matching is an extensively studied topic andthere has been tremendous algorithmic progress both in the binocular setting [19,55,29,56,26]as well as in the multi-view setting [37,17,53,20,10], in conjunction with advancesin benchmarking [34,14,38,23,1]. Traditionally, the best performing stereo meth-ods were based on approximate MRF inference on pixel grids [51,5,43], whereincluding suitable smoothness priors was considered quite crucial. However, suchmethods were usually computationally expensive. Hirschmuller [19] proposed

Deep Depth Prior 3

Semi-Global Matching (SGM), a method that provides a trade-off between accu-racy and efficiency by approximating a 2D MRF optimization problem with sev-eral 1D optimization problems. SGM has many recent extensions [13,2,41,36,39]and also works for multiple images [17]. Region growing methods are anotheralternative that has also shown promise and implicitly incorporate smoothnesspriors [4,37,27,18]Deep Stereo. In recent years, deep models for stereo have been proposedto compute better matching costs [55,28,7] or to directly regress disparity ordepth [29,22,56,6] and also for the multi-view setting [21,53,20]. Earlier on, end-to-end trainable CNN models did not employ any form of explicit regularization,but recently hybrid CNN-CRF methods have advocated using appropriate reg-ularization based on conditional random fields (CRFs) [26,52]. In contrast withthese works, as we do not perform learning by fitting to training data, our ap-proach is more generalizable as it does not fall prey to the tendency of deepapproaches to overfit to their training data.Depth Map Refinement The fast bilateral solver [3] is an optimization tech-nique for refining disparity or depth maps. However, the objective is fully hand-crafted. Knoblereiter and Pock recently proposed a refinement scheme where theregularizer in the optimization objective is trained using ground truth dispar-ity maps [25]. Their model learns to jointly reason about image color, stereomatching confidence and disparity. Voynov etal. [48] use a deep prior for depthsuper-resolution, but they do not have multiview constraint, as we do, nor dothey investigate refinement and hole-filling. Other recent disparity or depth maprefinement techniques utilize trained CNN models [31].Deep prior for color images. Beyond the previously discussed work of Ulyanovet al. [47], deep image priors have been extended for a number diverse applica-tions – neural inverse rendering [42], mesh reconstruction from 3D points [50],layer-based image decomposition [12]. Recently, Cheng et al. [8] pointed out im-portant connections between DIP and Gaussian processes. Our approach is in asimilar vein as these approaches, where we modify the DIP for depth maps bycombing the usual reconstruction loss with a second term, the photoconsistencyloss which ensures that when the reference image is warped into a neighboringview using our depth map, the discrepancy between the warped image and theoriginal image is minimized.

3 Method

Given a RGBD image with Iin as RGB component and Din as noisy depthcomponent, our goal is to generate denoised and impainted depth image D∗. Weleverage recently proposed Deep Image Prior (DIP) [47] to solve this problem.We first briefly describe the DIP approach.

3.1 Deep Image Prior

The DIP method proposed a deep network based technique for solving low levelvision problems such as image denoising, restoration, inpainting problems. At

4 Ghosh et al.

Inputnoise

Encoder-DecoderNetwork

+

WarpInbrusingDout

intocurrentviewpoint

PhotoconsistencyLoss

Inbr

Iout Iref

Dout

Dref

Reconstruc;onLoss(RGB)

Reconstruc;onLoss(Depth)

Fig. 2. Overview of Deep Depth Prior (DDP): The DDP network is trained usinga combination of L1 and SSIM reconstruction loss with respect to a target RGB-Dimage and a photo-consistency loss with respect to neighboring calibrated images.This network is used to refine a set of noisy depth maps and the refined depth mapsare subsequently fused to obtain the final 3D point cloud model. Iout and Dout are theRGB and depth output of our network. Inbr is the RGB at a neighboring viewpoint.Iref and Dref are the input RGB and depth at the current (reference) viewpoint.

the core of their method lies the idea that deep networks can serve as a priorfor such inverse problems. If x is the input image, n is the input noise and xois the denoised output of the network fθ, then the optimization problem of theDIP method takes the following form:

θ∗ = arg minθ

L(fθ(n);x), x∗o = fθ∗(n). (1)

The task of finding the optimal neural network parameters θ∗ and the optimaldenoised image x∗o is solved using standard backpropagation approach.

Fig. 3. (a) Input depth map with holes (b)DDP on just Depth maps and (c) DDP onRGBD images. In the black box regions in (b),DDP is filling up the holes in the sky or back-ground based on the depth from the houseor radio because it has no edge information.RGBD input provides this edge informationin (c).

A simple approach to addressdepth denoising and inpaintingwould be to use a DIP like encoder-decoder architecture to improve thedepth images. Here depth imageswould replace RGB as inputs in theoriginal DIP framework. Howeverthis fails to fill the holes with cor-rect depth values. Some of the vi-sual outputs are shown in Fig. 3.More quantitative results are pro-vided in Table 2.

We hypothesize three reasonsfor this failure. First holes near ob-ject boundary can cause incorrectdepth filling. Second depth imageshave more diverse values than RGBimages that leads to large quantization errors. Correctly predicting values forsuch large quantized space is a challenging task for the DIP network. Finally ab-

Deep Depth Prior 5

solute error for far objects may dominate the DIP optimization over importantnear by objects.

3.2 Deep Depth Prior

In this work, we propose Deep Depth Prior (DDP) approach that includes threenew losses to solve the issues discussed above. Our approach is built on top ofthe inpainting task in [47] where we create a mask for the holes in the depth mapand calculate the loss over the non-masked regions. Fig. 2 gives an overview ofour system.

Preliminary. To solve the issue of absolute error for far objects dominatingthe DIP optimizer, inverted depth or disparity images are used. We also adda constant value to depth image that reduces the ratio between the maximumand minimum depth values. Further, masking of all far away objects beyond acertain depth is performed by clipping to a predefined maximum depth value.We also clip depth to a minimum value so that the maximum disparity valuedoes not go to infinity. We will provide these values in the supplementary.

Let Dout be the desired depth output from our network and let fθ denote thisgenerator network. Input to the network is noise nin and the input depth mapis inverted to get Zin as the noisy disparity map. Let us represent the outputfrom the network as Zout where Zout = fθ(n

in;Zin). On convergence, optimalD∗ is obtained by inverting Z∗.

We use three different losses to optimize our network. Total loss is:

Ltotal = γ1Ldisp + γ2L

RGB + (1− γ1 − γ2)Lwarp. (2)

Disparity-based loss Ldisp. The simplest technique to obtain Z∗ is to optimizeonly on disparity. The disparity based loss Ldisp is a weighted combination ofMean Absolute Error (MAE) or L1 loss and Structural Similarity metric (SSIM)or Lss loss [49], and takes the following form:

Ldisp = λzL1(Zin, Zout) + (1− λz)Lss(Zin, Zout). (3)

We use L1 loss instead of Mean Squared Error(L2) loss to remove the effectof very high valued noise having a major effect on the optimization. It takes theform as L1(Zin, Zout) = |Zin − Zout|.

The structural similarity Lss loss measures similarity between the input Zin

and reconstructed disparity map Zout. Here similarity is defined at the blocklevel where each block size is 11x11. It provides consistency at the region level.The loss (Lss) takes the following form Lss = 1 − SSIM(Zin, Zout). Detailsabout the SSIM function are provided in the supplementary material.

RGB-based loss LRGB. We observed that under certain situations the disparitybased loss blurs the edges in the final generated depth map. This happens whenthere is a hole in the depth map near an object boundary. The generator networkproduces a depth map that is a fused version of depths of different objectsappearing around the hole.

6 Ghosh et al.

For example, consider regions belonging to sky in top row of Fig. 3. Dueto homogeneous nature of the sky pixels, standard disparity/depth estimationmethods fail to produce any disparity/depth for such regions. However, pixelscorresponding to house region have depth values. When the image is passed toa DIP generator, the edge between the house and the sky gets blurred becausethe network is trying to fill up the space without any additional knowledge e.g.,boundary information. It just bases its decision on depth of neighboring spaceto fill up the incomplete regions as seen in the top row of Fig. 3(b) in the blackbox region.

To solve this problem we also pass the color RGB image along with thedisparity image. The encoder-decoder based DDP architecture is now trained onthe 4 channel RGBD information. Weights of the network are updated not onlyon the masked disparity map but also on the full RGB image. This helps thenetwork to leverage the object boundary information in the RGB image to fill theholes in the disparity (and so in depth) image. This important edge informationprovided to the network helps to generate crisp depth images as seen in the Fig.3(c).

Let Iout be the output corresponding to input data Iin using the noise nin

and generator model fθ. It takes the form as Iout = fθ(nin; Iin).

The Loss LRGB is also a weighted combination of L1 and SSIM losses, andis defined as:

LRGB = λIL1(Iin, Iout) + (1− λI)Lss(Iin, Iout). (4)

The RGB based loss helps to resolve the issue of blurring observed aroundedges near object boundary. However, putting equal weights to disparity Ldisp

and RGB LRGB components of the total loss leads to artifacts appearing inthe disparity image and so in depth output images as well. In particular, theseartifacts are due to textures and edges from an object’s appearance that areunrelated to the depth boundaries.

For example in Fig. 4 the wall of the house is one surface and should havesmooth depth maps. However, the DDP network trained on RGBD data gener-ates vertical textures in the depth images that appear due to the vertical woodenplanks in the RGB image.

Warping-based loss. We propose to include a warping loss to improve ouroutputs. Before defining the warping loss, let us first define warping functionT refnbr . Given the camera poses Cref and Cnbr of the reference and neighboring

view , the function T refnbr warps neighboring view to reference view.We are trying to generate denoised output Zout for the reference view. We

first find top N neighboring views of the reference view. These neighboring viewsare generated using the method used in MVSNet [53]. Let nbr denote one of these

N views and let W refnbr be the warped image from neighboring view to reference

view. Let Doutref be the predicted depth (inverted Zoutref ) for the reference view

and Iinnbr is the input RGB for the neighboring view, then the warped image is

W refnbr = T refnbr (Dout

ref , Iinnbr;Cnbr, Cref ). Further, we also use bilinear interpolation

for warping that helps to remove holes.

Deep Depth Prior 7

Fig. 4. (a) RGB image (b) Input disparity map (c) Disparity output from DDP trainedwith equal weight for RGB and depth loss. The RGB artifacts are evident in (c) throughthe vertical and horizontal lines representing the wooden planks in the wall (d) Dis-parity output from DDP trained with lower weight for RGB compared to depth loss.The artifacts disappear in (d).

Now given Iinref , the input RGB for the reference view we can compute warp-ing loss as:

Lwarpnbr−ref = λwLss(Iinref ,W

refnbr ) + (1− λw)L1(Iinref ,W

refnbr ). (5)

When there are multiple neighboring views, the loss is averaged over themas Lwarpref = 1

N

∑Nnbr=1 L

warpnbr−ref .

Warping loss not only resolves the issue of artifacts appearing around edgeswithin objects, we observe it helps to improve disparity (and depth values) inother regions as well. Thus, the warping term improves overall accuracy. Impor-tance of each loss terms are explored in the experiment section 4.

Optimization. All three losses that we use are differentiable with respect to thenetwork parameters and so the network is optimized using standard backprop-agation.

4 Experiments

We demonstrate the effectiveness of our proposed approach on two different tasks- 1) depth completion and 2) depth refinement. We also show the generalizationability of our technique on new datasets with unseen statistical distributions.We evaluate results on four different datasets: 1) Tanks and Temples(TnT) [23],2) KITTI stereo benchmark [14], 3) Our own collected videos and 4) NYU depthV2 [40].

4.1 Tanks and Temples

We first evaluate the effectiveness of the presented approach on multi-view recon-struction task. We demonstrate the qualitative and quantitative improvement onseven sequences from the Tanks and Temples dataset (TnT dataset) [24]. Thesesequences include Ignatius, Caterpillar, Truck, Meetingroom, Barn, Courthouseand Church.

8 Ghosh et al.

Implementation. In this work, we applied the deep depth prior on a sequenceof depth images. In all our experiments, the base network is primarily an encoder-decoder based UNet architecture [33]. The UNet encoder consists of 5 convolu-tion blocks each consisting of 32, 64, 128, 256, and 512 channels. Each convolutionoperation uses 3x3 kernels. Further, we have also conducted experiments usingskip networks [47]. Input noise to the network is of size m × n × 16 where mand n are the dimensions of the input depth images. Training is performed usingAdam optimizer. Initial learning rate is set to 0.00005, that is reduced by a fac-tor of 0.01 after 12000 and 15000 epochs. The model is trained for 16000 epochs.Hyper-parameters of the loss function are set through searching on a small sub-set (10 images randomly chosen) from each of Ignatius and Barn sequence inTnT.

The depth images are fused using fusibile [11] to reconstruct the final 3Dpoint cloud. Fusibile has hyperparameters that determine the precision and recallvalues for the resultant point cloud. These parameters include the disparitythreshold and the number of consistent views. More details about fusibile andloss hyper-parameters are provided in the supplementary material.

Quantitative results. Our primary comparison is with popularly used semi-global matching method (SGM) [19] for depth image estimation. SGM is anoptimization based method that does not need any training data. We also com-pare with a state-of-the-art learning based method: MVSNet [53]. Further, itshould be noted that our approach is agnostic to the depth estimation method,i.e. it can be used to improve depth maps from any source.

We compare our reconstructed point clouds to the ground truth point cloudsfor all 7 sequences in the TnT dataset [24] using the benchmarking code includedwith the dataset, which returns the precision (P = TP

TP+FP ), recall (R = TPTP+FN )

and f-score (F = 2. P.RP+R ) values for each scene given the reconstructed point cloud

model and a file containing estimated camera poses used for that reconstruction.Here, TP , FP and FN are true positives, false positives and false negativesrespectively. For each of the sequences they also fix the point where precisionand recall are reported given the precision-recall curve.

Ablation studies. We conduct a series of experiments on the Ignatius datasetto study impact of each parameter and component of the proposed method. Wefirst study the effect of number of depth images on final reconstruction. For thispurpose, we skip a constant number of images in the dataset. For example, a skipsize of 2 means that only every second image is used for reconstruction givingus 85 images in total for Ignatius. Such reduction in the data size increases thenumber of holes on the SGM-based reconstruction method. The same reduceddataset is used for the baseline and our approach. We also conducted exper-iments with skip values of 4, 8 and 16 giving 43, 22 and 11 images. Table 1provides details about the impact of this skip size on relative improvement. ForSGM+DDP, we keep SGM values where there are no holes, and replace the holeswith DDP output values in those regions. It can be observed that at skip sizesof 2, 4, 8 and 16, we see an improvement of 0.3, 0.16, 0.51 and 0.55 percentage

Deep Depth Prior 9

Table 1. We compare the accuracy ofDDP on datasets of different sizes. Asthe number of images become smaller,the holes increase and we get the mostrelative performance gain at number ofimages=11. Higher is better.

Dataset SGM SGM+DDP

Ignatius( 85 images) 44.40 44.70Ignatius( 43 images) 43.36 43.52Ignatius( 22 images) 31.66 32.17Ignatius( 11 images) 30.70 31.25

Table 2. The results of comparing theDDP output using disparity, RGBD andwarping loss and using UNet or Skip-Net as the network are shown here,P:Precision, R:Recall, F:F-Score. Higheris better.

Network+Loss P R F

UNet+D 33.91 35.61 34.74UNet+RGBD 36.71 36.56 36.63UNet+RGBD+warp 39.87 48.50 43.76SkipNet+RGBD+warp 39.94 48.26 43.71

Table 3. Quantitative results comparing 7 sequences for SGM based depths and ap-plying the DDP on SGM depths. We combine DDP with SGM by replacing the depthvalues in the holes of SGM depth with DDP depth. Here Num img: number of images insequence, Consistent views: number of consistent views while constructing point cloud,P: precision, R:recall, F:f-score. Higher is better.

Method Num img Consistant disparity SGM SGM+DDP(Ours)views threshold P R F P R F

Ignatius 85 5 1.0 41.83 47.30 44.40 41.21 48.83 44.70Ignatius 22 2 2.0 36.37 28.03 31.66 35.66 29.31 32.17Barn 180 2 0.5 23.34 27.81 25.38 22.78 29.26 25.61Barn 45 1 4.0 19.06 21.43 20.18 18.10 22.79 20.18Truck 99 4 1.0 35.76 38.80 37.22 34.54 41.36 37.64Truck 25 1 2.0 29.44 33.81 31.48 27.69 36.66 31.55Caterpillar 156 4 1.0 24.92 41.47 31.13 23.96 42.93 30.75Caterpillar 39 1 2.0 17.28 36.54 23.46 16.21 37.89 22.71Meetingroom 152 4 1.0 27.51 13.43 18.05 25.24 15.24 19.01Meetingroom 38 1 4.0 17.80 9.03 11.98 15.68 10.72 12.73Courthouse 110 2 1.0 1.78 0.84 1.14 3.20 1.17 1.71Church 86 4 1 8.92 8.47 8.69 8.88 8.62 8.75

point of relative improvement in F-measures over the baseline respectively. Itsuggests that at higher skip sizes, our approach provides necessary prior infor-mation to fill holes. It should be noted that we have not included experimentswith skip size of one. Running fusible on all of the Ignatius images producesdense reconstruction without holes. Running DDP on this dense reconstructionhas no effect, and so we have not included results with skip size of one in thetable. One of the goals of this work is for our accurate hole-filling to to allow forfewer images to be captured and used for reconstruction.

Next we show the advantage of using RGBD and warping loss over disparityloss only. Running the optimization for too many epochs on depth only DDP canreturn the holes in the result, thus we run depth only based networks for 6000epochs instead of 16000. The results are in Table 2, which also shows the benefitof photo-consistency based warping loss. We observe a relative improvement

10 Ghosh et al.

Table 4. We compare the reconstruction performance using depth map generated byMVSNet and depth map generated by applying DDP on MVSNet. P:precision, R:recall,F:f-score. Higher is better.

Dataset MVSNet MVSNet+DDPP R F P R F

Ignatius ( 126 images) 45.89 52.19 48.84 45.16 54.89 49.55Ignatius ( 32 images) 40.25 46.78 43.27 38.94 50.00 43.78

of 7.13% in F-measure after incorporation of the warping loss. Finally, we alsoconducted experiments with Skip-Net [47] to show the impact of using a networkdifferent from UNet. Table 2 shows they give comparable results, and we choseto use UNet for our other experiments since it is a more commonly used network.

Comparison to prior work Quantitative comparison with the SGM baselineon the 7 TnT dataset is shown in the Table 3. The SGM+DDP (Ours) columnshows the results of using DDP to improve depth maps from SGM. We combineDDP depth maps with SGM depth maps, keeping the SGM values everywherewhere there are no holes and replace the holes with DDP depth values. Thetable includes the precision, recall and f-scores on each dataset with sub-samplednumber of images done using the technique described in the Ablation Studiessub-section. We observe improvement in recall and f-score values of the presentedapproach over the SGM-method. It suggests the method helps in hole filing.F-score improves in 6 of the 7 sequences. Overall SGM+DDP(Ours) helps toimprove recall by 1.49 percentage points and f-score by 0.23 percentage points.In particular, we see a significant improvement of 1.75 in recall and 0.86 in f-scoreon the MeetingRoom sequence.

We also tested using the learning-based MVSNet method as an input to ourmethod. While the results from MVSNet do not contain any holes, as they predicta value at every pixel, just as we do, the do have areas where the predictions havelow confidence. We use their output probability map which shows confidence ofdepth prediction and removing depths at places with confidence below a certainthreshold, we see if we can fill those areas more accurately than MVSNet did.We then fill up these holes using DDP and compare the results with the original.These results for Ignatius are in Table 4. We see 0.61% improvement in F-scoreand 2.96% in recall.

Qualitative results. Next we provide visual results on the TnT dataset tohighlight the impact of our approach in achieving high quality reconstruction.In Fig. 5, we show output disparity images at 500 (column c) and 16000 (columnd) epochs of our proposed approach on 5 TnT sequences. Note that the holesin the input disparity images (marked as white in column b) are filled in theoutput images (c, d). Further, qualitative improvement can be observed acrossepochs on comparing visual outputs. For example, disparity of the backgroundwall in second row image is more consistent at 16000 epochs than at 500 epochs.

Deep Depth Prior 11

Fig. 5. (a) Input RGB and (b) input depth images for Ignatius, Meeting Room, Barn,Caterpillar and Truck and predicted depth at epochs (c) 500 (d) 16000.(Best viewedin digital)

4.2 KITTI

Next we show the performance of the proposed approach on the KITTI stereobenchmark (2015) [30] for 3 different input disparity maps in Table5. The eval-uation metrics used in this benchmark is D1 that measures the error percentagewhen the predicted value differs from the groundtruth value by 3px or 5% ormore. This error is separately measured for background, foreground and all re-gions together. 3 approaches that generate input depths for DDP are as follows:

1. We take the output of a state of the art deep learning based stereo estima-tion technique, HD3 [54], as the input to our method. We apply DDP onthese already complete and accurate depth maps, so we do not observe anyimprovement due to applying the DDP.

2. In the previous setup, the deep learning model has been trained and testedon the KITTI dataset. However, a depth (or stereo) estimation methodshould work on any environment. Traditional stereo methods like ELAS [15]works without any environment constraints. So, in this experiment, we com-pare the performance of DDP method applied on ELAS method to that ofa deep learning model trained on a out-of-distribution data. For the out-of-distribution model, we used HD3 model trained on the FlyingThings3Ddataset [29]. We applied DDP based refinement and completion over depthfrom ELAS method. In the first setting it generates complete disparity mapswithout any holes. We use DDP for refinement and completion. We observea small performance improvement (∼0.1% reduction in D1-all error). How-ever, compared to the HD3 model trained on the Things dataset both ELASand ELAS + DDP achieves almost 71 percentage point improvement. It

12 Ghosh et al.

highlights that our approach with traditional stereo methods can work wellunder different environment conditions.

3. The second set of parameters for ELAS generates depth images with holes.In this case, DDP applied over ELAS depth helps to inpaint and completethe ELAS depth. Here we observe a 4% reduction in D1-all error.

4.3 Our dataset

Table 5. Results on KITTI DatasetMethod D1-bg D1-fg D1-all

HD3 1.8 3.6 2.1HD3+DDP 1.8 3.7 2.1HD3 trained 78.74 79.16 78.81on ThingsELAS(params1) 7.5 21.1 9.8ELAS+DDP 7.5 21.1 9.7(params1)ELAS(params2) 20.6 28.7 22.0ELAS+DDP 16.3 26.1 17.9(params2)

We used a cell-phone camera to cap-ture 5 scenes, both indoor and out-door. The datasets are monocularvideo sequences and are named Guitar,Shed, Stones, Red Couch, and Van.

To better understand the qualityof the depth images generated by theproposed method, we warp RGB im-ages using the original and the pro-posed DDP generated depth imagesfor Guitar, Shed, Truck, Caterpillarand Stones from TnT and our video se-quences. The warped RGB images areshown in the Fig. 6. Holes can be ob-served in the RGB images warped using original depth maps for example alongthe smooth reflective side surface of the guitar, the roof of the shed, some partsof the truck and caterpillar and along the base of the Stone Wall. However, RGBimages warped using our depth images removes large portions of these holes andare far smoother.

We show reconstructed point clouds on RedCouch, Stones, Guitar and Vanin Fig. 7. The first column is one of the input RGB images used for the recon-struction, second column is the reconstruction from the original SGM output,the third one is from SGM + DDP and the fourth column is the reconstructedresult from MVSNet.

The number of views used for these reconstructions are small (∼10). Wechose these datasets to show specific ways in which our reconstructions improveon the original SGM and even MVSNet outputs. As we can see from our results,there are a lot less holes in the reconstructions computed from our depth mapscompared to original SGM and MVSNet. For example from the top view of theRedCouch in the first row, we can see the relatively obscure region behind thepillow. DDP successfully fills up a big portion of this hole. Next the base of theStone Wall in the second row also gets filled. Finally the reflective surfaces ofthe guitar in the third row and the van in the fourth row gets completely or atleast partially full depending on how big the original hole was. For example notehow the text ”FREE” on the van is more readable in the DDP result.

Deep Depth Prior 13

4.4 NYU depth V2

Finally we demonstrate performance of our approach on hole-filling Kinect depthdata. For this, we use data from the popular NYU V2 depth dataset [40]. Quali-tative experiments on the NYU V2 depth images are shown in Fig. 8. We cannotdirectly use the test data from NYU, since we need at least two views as input. Sowe downloaded 3 video scenes, and used structure-from-motion (SfM) pipeline[46] to get the extrinsic camera pose information. Using this, we apply DDP toremove holes in the Kinect depth maps given by the dataset. We directly refinedthe Kinect depth images, and as the originals are incomplete, we cannot usethem for performance analysis. Instead we use the depth maps to project oneRGB view to another in the video sequences to compte and RGB re-projectionerror, which we quantify with Peak Signal to noise ratio (PSNR). It is defined asPSNR = 20 log10(MAXI/

√MSE), where MAXI is the maximum value of the

image and MSE is the mean squared error of the image. As we can see in Fig.8, DDP fills up holes and improves the PSNR. We also use the cross-bilateralhole filled depth maps included in the NYU V2 dataset as a baseline. We observeconsistent qualitative and quantitative improvements using DDP compared tothe cross-bilateral method.

5 Conclusions

We have presented an approach to reconstruct depth maps from incomplete ones.We leverage the recently proposed idea of utilizing a neural network as a prior fornatural color images, and introduced three new loss terms that complete depthmaps. Extensive qualitative and quantitative experiments on sequences from theTanks and Temples, KITTI, NYU and our own dataset demonstrated the impactof improved depth maps generated by our presented method. Important futureextensions include improving the speed of the method and understanding of theconvergence properties of DDP.

Guitar Shed Truck Caterpillar Stones

(a)

(b)

(c)

Fig. 6. (a) Original Image from a neighboring view point (b) Novel view synthesizedfrom the same camera view point using the original SGM depth (c) Novel view synthesisfrom the same camera view point using DDP depth. The holes that appear in (b) getsfilled in (c)(Best viewed in digital)

14 Ghosh et al.

Fig. 7. (a) Input RGB image and reconstructed point-cloud from (b) SGM (c) DDP(Ours) and (d) MVSNet depth images for RedCouch, Stones, Guitar and Van. Ourreconstructions are more complete.(Best viewed in digital)

Fig. 8. (a) Input RGB (b) Input depth (c) Depth completed using cross-bilateral filterand (d) Depth completed using DDP (Best viewed in digital)

Deep Depth Prior 15

References

1. Aanæs, H., Jensen, R.R., Vogiatzis, G., Tola, E., Dahl, A.B.: Large-scale data formultiple-view stereopsis. International Journal of Computer Vision pp. 1–16 (2016)

2. Banz, C., Blume, H., Pirsch, P.: Real-time semi-global matching disparity esti-mation on the gpu. In: 2011 IEEE International Conference on Computer VisionWorkshops (ICCV Workshops). pp. 514–521. IEEE (2011)

3. Barron, J.T., Poole, B.: The fast bilateral solver. In: European Conference onComputer Vision. pp. 617–632. Springer (2016)

4. Bleyer, M., Rhemann, C., Rother, C.: Patchmatch stereo-stereo matching withslanted support windows. In: Bmvc. vol. 11, pp. 1–11 (2011)

5. Bleyer, M., Rother, C., Kohli, P., Scharstein, D., Sinha, S.: Object stereojoint stereomatching and object segmentation. In: CVPR 2011. pp. 3081–3088. IEEE (2011)

6. Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition. pp. 5410–5418(2018)

7. Chen, Z., Sun, X., Wang, L., Yu, Y., Huang, C.: A deep visual correspondence em-bedding model for stereo matching costs. In: Proceedings of the IEEE InternationalConference on Computer Vision. pp. 972–980 (2015)

8. Cheng, Z., Gadelha, M., Maji, S., Sheldon, D.: A bayesian perspective on the deepimage prior. In: CVPR. pp. 5443–5451 (2019)

9. Furukawa, Y., Curless, B., Seitz, S.M., Szeliski, R.: Towards internet-scale multi-view stereo. In: 2010 IEEE computer society conference on computer vision andpattern recognition. pp. 1434–1441. IEEE (2010)

10. Galliani, S., Lasinger, K., Schindler, K.: Massively parallel multiview stereopsisby surface normal diffusion. In: 2015 IEEE International Conference on ComputerVision (ICCV). pp. 873–881 (Dec 2015). https://doi.org/10.1109/ICCV.2015.106

11. Galliani, S., Lasinger, K., Schindler, K.: Massively parallel multiview stereopsis bysurface normal diffusion. In: Proceedings of the IEEE International Conference onComputer Vision. pp. 873–881 (2015)

12. Gandelsman, Y., Shocher, A., Irani, M.: ” double-dip”: Unsupervised image de-composition via coupled deep-image-priors. In: CVPR (2019)

13. Gehrig, S.K., Eberli, F., Meyer, T.: A real-time low-power stereo vision engine usingsemi-global matching. In: International Conference on Computer Vision Systems.pp. 134–143. Springer (2009)

14. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kittidataset. The International Journal of Robotics Research 32(11), 1231–1237 (2013)

15. Geiger, A., Roser, M., Urtasun, R.: Efficient large-scale stereo matching. In: Asianconference on computer vision. pp. 25–38. Springer (2010)

16. Goesele, M., Snavely, N., Curless, B., Hoppe, H., Seitz, S.M.: Multi-view stereofor community photo collections. In: 2007 IEEE 11th International Conference onComputer Vision. pp. 1–8. IEEE (2007)

17. Haala, N., Rothermel, M.: Dense multi-stereo matching for high quality digitalelevation models. Photogrammetrie-Fernerkundung-Geoinformation 2012(4), 331–343 (2012)

18. Heise, P., Klose, S., Jensen, B., Knoll, A.: Pm-huber: Patchmatch with huberregularization for stereo matching. In: Proceedings of the IEEE International Con-ference on Computer Vision. pp. 2360–2367 (2013)

19. Hirschmuller, H.: Stereo processing by semiglobal matching and mutual informa-tion. IEEE Transactions on pattern analysis and machine intelligence 30(2), 328–341 (2007)

https://doi.org/10.1109/ICCV.2015.106

16 Ghosh et al.

20. Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: Deepmvs: Learningmulti-view stereopsis. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 2821–2830 (2018)

21. Ji, M., Gall, J., Zheng, H., Liu, Y., Fang, L.: Surfacenet: An end-to-end 3d neu-ral network for multiview stereopsis. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 2307–2315 (2017)

22. Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A.,Bry, A.: End-to-end learning of geometry and context for deep stereo regression.In: Proceedings of the IEEE International Conference on Computer Vision. pp.66–75 (2017)

23. Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: Benchmark-ing large-scale scene reconstruction. ACM Trans. Graph. 36(4), 78:1–78:13 (Jul2017). https://doi.org/10.1145/3072959.3073599, http://doi.acm.org/10.1145/

3072959.307359924. Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: Benchmarking

large-scale scene reconstruction. ACM Transactions on Graphics (ToG) 36(4), 78(2017)

25. Knobelreiter, P., Pock, T.: Learned collaborative stereo refinement. In: GermanConference on Pattern Recognition. pp. 3–17. Springer (2019)

26. Knobelreiter, P., Reinbacher, C., Shekhovtsov, A., Pock, T.: End-to-end trainingof hybrid cnn-crf models for stereo. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 2339–2348 (2017)

27. Lu, J., Yang, H., Min, D., Do, M.N.: Patch match filter: Efficient edge-aware fil-tering meets randomized search for fast correspondence field estimation. In: Pro-ceedings of the IEEE conference on computer vision and pattern recognition. pp.1854–1861 (2013)

28. Luo, W., Schwing, A.G., Urtasun, R.: Efficient deep learning for stereo matching.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition. pp. 5695–5703 (2016)

29. Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox,T.: A large dataset to train convolutional networks for disparity, optical flow, andscene flow estimation. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 4040–4048 (2016)

30. Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Conferenceon Computer Vision and Pattern Recognition (CVPR) (2015)

31. Pang, J., Sun, W., Ren, J.S., Yang, C., Yan, Q.: Cascade residual learning: Atwo-stage convolutional neural network for stereo matching. In: Proceedings of theIEEE International Conference on Computer Vision. pp. 887–895 (2017)

32. Ren, Z., Yan, J., Ni, B., Liu, B., Yang, X., Zha, H.: Unsupervised deep learning foroptical flow estimation. In: Thirty-First AAAI Conference on Artificial Intelligence(2017)

33. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-cal image segmentation. In: International Conference on Medical image computingand computer-assisted intervention. pp. 234–241. Springer (2015)

34. Scharstein, D., Hirschmuller, H., Kitajima, Y., Krathwohl, G., Nesic, N., Wang, X.,Westling, P.: High-resolution stereo datasets with subpixel-accurate ground truth.In: German conference on pattern recognition. pp. 31–42. Springer (2014)

35. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-framestereo correspondence algorithms. Int. J. Comput. Vision 47(13), 742 (Apr2002). https://doi.org/10.1023/A:1014573219977, https://doi.org/10.1023/A:

1014573219977

https://doi.org/10.1145/3072959.3073599

http://doi.acm.org/10.1145/3072959.3073599

http://doi.acm.org/10.1145/3072959.3073599

https://doi.org/10.1023/A:1014573219977

https://doi.org/10.1023/A:1014573219977

https://doi.org/10.1023/A:1014573219977

Deep Depth Prior 17

36. Scharstein, D., Taniai, T., Sinha, S.N.: Semi-global stereo matching with surfaceorientation priors. 2017 International Conference on 3D Vision (3DV) pp. 215–224(2017)

37. Schonberger, J.L., Zheng, E., Pollefeys, M., Frahm, J.M.: Pixelwise view selectionfor unstructured multi-view stereo. In: European Conference on Computer Vision(ECCV) (2016)

38. Schops, T., Schonberger, J.L., Galliani, S., Sattler, T., Schindler, K., Pollefeys,M., Geiger, A.: A multi-view stereo benchmark with high-resolution images andmulti-camera videos. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 3260–3269 (2017)

39. Seki, A., Pollefeys, M.: Sgm-nets: Semi-global matching with neural networks. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 231–240 (2017)

40. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and supportinference from RGBD images. In: Computer Vision - ECCV 2012 - 12th EuropeanConference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings,Part V. Lecture Notes in Computer Science, vol. 7576, pp. 746–760. Springer (2012)

41. Sinha, S.N., Scharstein, D., Szeliski, R.: Efficient high-resolution stereo matchingusing local plane sweeps. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 1582–1589 (2014)

42. Taniai, T., Maehara, T.: Neural inverse rendering for general reflectance photo-metric stereo. In: ICML (2018)

43. Taniai, T., Matsushita, Y., Naemura, T.: Graph cut based continuous stereo match-ing using locally shared labels. In: Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition. pp. 1613–1620 (2014)

44. Taniai, T., Matsushita, Y., Sato, Y., Naemura, T.: Continuous 3d label stereomatching using local expansion moves. IEEE transactions on pattern analysis andmachine intelligence 40(11), 2725–2739 (2017)

45. Tonioni, A., Poggi, M., Mattoccia, S., Di Stefano, L.: Unsupervised adaptation fordeep stereo. In: Proceedings of the IEEE International Conference on ComputerVision. pp. 1605–1613 (2017)

46. Ullman, S.: The interpretation of structure from motion. Proceedings of the RoyalSociety of London. Series B. Biological Sciences 203(1153), 405–426 (1979)

47. Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. In: CVPR. pp. 9446–9454 (2018)

48. Voynov, O., Artemov, A., Egiazarian, V., Notchenko, A., Bobrovskikh, G., Bur-naev, E., Zorin, D.: Perceptual deep depth super-resolution. In: Proceedings of theIEEE International Conference on Computer Vision. pp. 5653–5663 (2019)

49. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: Fromerror visibility to structural similarity. In: IEEE transactions on image processing.p. 600612. IEEE (2004)

50. Williams, F., Schneider, T., Silva, C., Zorin, D., Bruna, J., Panozzo, D.: Deepgeometric prior for surface reconstruction. In: CVPR. pp. 10130–10139 (2019)

51. Woodford, O., Torr, P., Reid, I., Fitzgibbon, A.: Global stereo reconstruction un-der second-order smoothness priors. IEEE transactions on pattern analysis andmachine intelligence 31(12), 2115–2128 (2009)

52. Xue, Y., Chen, J., Wan, W., Huang, Y., Yu, C., Li, T., Bao, J.: Mvscrf: Learningmulti-view stereo with conditional random fields. In: Proceedings of the IEEEInternational Conference on Computer Vision. pp. 4312–4321 (2019)

18 Ghosh et al.

53. Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: Depth inference for unstruc-tured multi-view stereo. In: Proceedings of the European Conference on ComputerVision (ECCV). pp. 767–783 (2018)

54. Yin, Z., Darrell, T., Yu, F.: Hierarchical discrete distribution decomposition formatch density estimation. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 6044–6053 (2019)

55. Zbontar, J., LeCun, Y., et al.: Stereo matching by training a convolutional neuralnetwork to compare image patches. Journal of Machine Learning Research 17(1-32), 2 (2016)

56. Zhou, C., Zhang, H., Shen, X., Jia, J.: Unsupervised learning of stereo matching. In:2017 IEEE International Conference on Computer Vision (ICCV). pp. 1576–1584(Oct 2017). https://doi.org/10.1109/ICCV.2017.174

https://doi.org/10.1109/ICCV.2017.174

arXiv:2001.07791v2 [cs.CV] 31 Mar 2020 · 2020-04-02 · Depth Completion using a View Constrained...

Documents

Transcript of arXiv:2001.07791v2 [cs.CV] 31 Mar 2020 · 2020-04-02 · Depth Completion using a View Constrained...