Using CNNs to Estimate Depth from Stereo•Depth...

Post on 05-Jul-2020

4 views 0 download

Transcript of Using CNNs to Estimate Depth from Stereo•Depth...

• Depth estimation is an important problem in the fields of3DTV, Virtual Reality and in Autonomous vehicles

• Conventional image processing algorithms don’t producesatisfactory results for complex, real world scenes

• We explore the use of CNNs to tackle this problem

Using CNNs to Estimate Depth from Stereo ImageryTyler Jordan {tsjordan}, Skanda Shridhar {sshridha}, Jayant Thatte {}

Department of Electrical Engineering, Stanford University

Motivation Postprocessing[1, 4]

Experimental Results


Matching Cost



Disparity MapCross-Based

Cost Aggregation

Energy Constraints

Occlusion Interpolation

Major objects in the scenes like the road, signs, and cars are accurate in the disparity maps. The right and left edges are not as clean as the center of theimage due to the lack of redundant data. The CNN approach performs far better than the naïve plane-sweep approach.

9x9 patches from left and right images

Support region (red) created by union of horizontalcrosses along the vertical cross. The cross length aredetermined by intensity difference and lengthconstraints. This allows for context-based blurring

Regions occluded in theleft image (blue) arefilled in with data fromthe right (red)

Cross-Based Cost Aggregation Occlusion Interpolation





Holes in depth map arefilled with interpolationusing bidirectional matching

Regions where the right andleft depth map don’t agreeafter occlusion interpolationare filled by the median ofthe closest good pixels in 16directions

References[1] Zhang et. al. "Cross-based local stereo matchingusing orthogonal integral images." Circuits and Systemsfor Video Technology, IEEE Transactions on 19.7 (2009)[2] Kim et. al. "3D scene reconstruction from multiplespherical stereo pairs." International journal ofcomputer vision 104.1 (2013)

Convolutional Neural Network

L1 Filters

• Inputs: 1 million matching & non-matchingimage patches are fed into the network

• Output: Stereo matching cost, per pixel• Training: The 1st layer is convolutional; all

remaining layers are fully connected• Testing: All fully connected layers are

expressed as conv. layers so that the entiretest image can be processed at once

• Platform: Caffe w/ CuDNN GPU acceleration

[3] Kim et al. "Dynamic 3d scene reconstructionin outdoor environments." In Proc. IEEE Symp. on3D Data Processing and Visualization. 2010.[4] Žbontar et. al. "Computing the stereomatching cost with a convolutional neuralnetwork." arXiv preprint arXiv:1409.4326 (2014).