DFineNet: Ego-Motion Estimation and Depth Refinement from ...
Unsupervised Learning of Depth and Ego-Motion from...
Transcript of Unsupervised Learning of Depth and Ego-Motion from...
![Page 1: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/1.jpg)
UnsupervisedLearningofDepthandEgo-MotionfromVideo
Presenter:ZhangMengheJan.18th ,2019
Tinghui Zhou1,MatthewBrown2,NoahSnavely2,DavidG.Lowe2
UCBerkeley1,Google2
![Page 2: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/2.jpg)
Background
► Humans can easily perceive 3D from 2D
Image from Cityscape Dataset
![Page 3: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/3.jpg)
Background
► Humans can easily perceive 3D from 2D
CloseFar
Farther
Image from Cityscape Dataset
![Page 4: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/4.jpg)
Background
► Humans can easily perceive 3D from 2D
Image from Cityscape Dataset
![Page 5: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/5.jpg)
Background
► Humans can easily perceive 3D from 2D
Won’t Hit!!
Cuboidal
Image from Cityscape Dataset
![Page 6: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/6.jpg)
Background
► Projection kills the 3rd Dimension
A specific object shape in the 2D plane could be caused by multiple different 3D objects
[Sinha & Adelson, 1993]
![Page 7: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/7.jpg)
►Mimic humans’ approach
Learn 3D from a large number of 2D views without any ground-truth 3D labels.
TrainingMulti-views
TestingSingle-view
Background
![Page 8: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/8.jpg)
RelatedWork
Structure from Motion
► Estimate 3D structures from 2D image sequences that may be coupled with local motion signals.
► Rely on accurate image correspondence ▪ Bad performance with low texture, complex geometry/photometry, thin structures, occlusions
Structure from Motion from Multiple ViewsImage from Mathworks
![Page 9: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/9.jpg)
RelatedWork
Warping-based View Synthesis
► Synthesize the appearance of the scene seen from novel camera view points.
► DeepStereo▪ End-to-end learning construct view by transforming the input based on depth or flow. ▪ The Underlying geometry is represented by quantized depth planes.
Render a new view at C from existing images at 𝑽𝟏 and 𝑽𝟐[DeapStereo] Flynn et al.2015
![Page 10: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/10.jpg)
RelatedWork
Warping-based View Synthesis
► Synthesize the appearance of the scene seen from novel camera view points.
►Warping-based method: ▪ Estimate the underlying 3d geometry explicitly or establish pixel correspondence among input views.▪ Synthesis the novel views by compositing image patches from input views.▪ Forced to learn intermediate predictions of geometry and /or correspondence
Render a new view at C from existing images at 𝑽𝟏 and 𝑽𝟐[DeapStereo] Flynn et al.2015
![Page 11: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/11.jpg)
RelatedWork
Single Image 3D Prediction
19632005
20142017
“Blocks world” Larry Roberts
[Photo Pop-up] Hoiem et al. [Make 3D] Saxena et al.
[Multi-Scale DN] Eigen et al.
[Semi-Supervised Deep Learning Kuznietsov et al]
![Page 12: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/12.jpg)
RelatedWork
Learning 3D from Posed 2D Views
A stereopsis based auto-encoder setup: Part 1 is encoder CNN that maps image to depth map.Part 2 is decoder synthesize a backward warp image.
Part 3 simple loss to match reconstructed output with encoder input
[Unsupervised CNN for Single View Depth Estimation] Garg et al, 2016
![Page 13: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/13.jpg)
MainIdea
►Get Intermediate Prediction & Camera Poses
►Build ENTIRE View Synthesis Pipeline
▪ The inference procedure of a convolutional neural network.
▪ The network is forced to learn about intermediate tasks of depth and camera pose estimation
Build a consistently well-performed geometric view synthesis system
![Page 14: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/14.jpg)
MainIdea
► Jointly train a single-view depth CNN and a camera pose estimation CNN from unlabeled video sequences.
► Jointly train, Independently use.
► Totally unsupervised.
Unlabeled Video Clips
Single-view depth
Relative pose
Joint Training Framework
![Page 15: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/15.jpg)
ApproachesView Synthesis
𝒕 − 𝟏 𝒕 𝒕 + 𝟏
... ...
projectionprojection
projection
► Learn from Video clips
► Geometric-based view synthesisIf we know the 3D model and camera viewpoints of video frames, we can synthesis video frames by projection.
Image from https://kushalvyas.github.io/stitching.html
![Page 16: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/16.jpg)
ApproachesView Synthesis as Supervision
projection
CNN
projection
► Use this task as supervision.
► Learn both 3D and pose estimation
CNN
CNN
𝑻𝒕(𝟏,𝒕 𝑻𝒕,𝒕*𝟏
Image from https://kushalvyas.github.io/stitching.html
![Page 17: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/17.jpg)
ApproachesView Synthesis as Supervision
projection
CNN
projection
► Use this task as supervision.
► Learn both 3D and pose estimation
► 3D representation: Depth Map, Voxels, Layers
CNN
CNN
𝑻𝒕(𝟏,𝒕 𝑻𝒕,𝒕*𝟏
Depth Map
Image from https://kushalvyas.github.io/stitching.html
![Page 18: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/18.jpg)
ApproachesView Synthesis as Supervision
►Depth CNN:▪Input: Single frame at time 𝑡, 𝐼.▪Output: Per-pixel depth map 𝐷0.
►Pose CNN:
▪Input: Target View(𝐼.) and the nearby/source views(𝐼.(1, 𝐼.*1) ▪Output: Relative camera poses(𝑇4.→.(1, 𝑇6.→.*1)Overview of the supervision pipeline based on view synthesis
![Page 19: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/19.jpg)
ApproachesView Synthesis as Supervision
𝐿9: = < <|𝐼. 𝑝 − 𝐼?: 𝑝 |�
A
�
:∈{DEFGHIJGFKE:}
Photometric error as objective
►Parameters▪Input: < 𝐼1,… 𝐼D >as training frames, where 𝐼. is the target view others are source view 𝐼: .▪ 𝐼?: is the source view 𝐼: warped to target coordinate frame(discuss later)▪𝑝 indexes over pixel coordinates
How to solve?
![Page 20: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/20.jpg)
ApproachesDifferentiable depth image rendering ►Reconstruct target view 𝐼.
by sampling pixels from a source view 𝐼:
►Parameters▪ 𝐾 intrinsic camera matrix▪ 𝐷0. predicted depth map▪ 𝑇6.→: relative pose between 𝑝.and 𝑝:Illustration of the differentiable image warping process.
𝑝:~𝐾𝑇6.→: 𝐷0.(𝑝.)𝐾(1 𝑝.
![Page 21: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/21.jpg)
ApproachesDifferentiable Pixel Sampling ►Differentiable bilinear
sampling mechanism
to sample the continuous 𝑝:
Illustration of the differentiable image warping process.
𝐼?:(𝑝4.) = 𝐵𝑖𝑙𝑖𝑛𝑒𝑎𝑟𝑖𝑛𝑡𝑒𝑟𝑝𝑜𝑙𝑎𝑡𝑖𝑜𝑛𝐼:(𝑝:.Z, 𝑝:HZ, 𝑝:HG ,𝑝:.G)
![Page 22: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/22.jpg)
ApproachesModeling the model limitation
►Explainability prediction network that output a per-pixel soft mask 𝐸6:
𝐿9: = < <𝐸6:(𝑝)|𝐼. 𝑝 − 𝐼?: 𝑝 |�
A
�
\]^..]`a∈:
![Page 23: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/23.jpg)
ApproachesOvercoming Gradient Locality
►Multi-scale and smoothness loss.
►Allow gradients to be derived from larger spatial regions directly.
𝐿JbDFZ = ∑ 𝐿9:�Z +defghiijk* 𝜆E𝐿GEm(𝐸6:)
Final Version
![Page 24: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/24.jpg)
ApproachesNetwork Architecture
►Depth Network = DispNet+ multi-scale side predictions▪ DispNet: kernel size 3 for all layers except the first (7,7,5,5) size layers.
![Page 25: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/25.jpg)
ApproachesNetwork Architecture
►Pose Net & Explainability networks: ▪ Share the first 5 feature encoding layers
▪ Branch out to predict 6-DOF relative pose and multi-scale explainability masks
▪ Kernel size 3 for all the layers except for the first 2 and last 2 layers with(7,5,5,7) respectively.
![Page 26: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/26.jpg)
ExperimentDatasets
► Cityscape▪ Large, semantic, instance-wise, dense pixel annotations of 30 classes▪ 5000 images with high quality annotations, 20 000 images with coarse annotations, 50 different cities
► KITTI▪ Smaller►Make3D▪ Range Image Data
![Page 27: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/27.jpg)
ExperimentResults for depth map-KITTI
Input Ground-Truth
Depth Supervised
Pose Supervised
Un-supervised
►Compared with other supervised training results.
Comparable without using any ground-truth depth or pose labels.
![Page 28: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/28.jpg)
Experiment
►Compared with other supervised training results.
Comparable without using any ground-truth depth or pose labels.
Results for depth map-KITTI
Input Ground-Truth
Depth Supervised
Pose Supervised
Un-supervised
![Page 29: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/29.jpg)
ExperimentResults for depth map
Single-view depth results on the KITTI dataset
![Page 30: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/30.jpg)
ExperimentResults for depth map- KITTI Finetune
►Comparison of single-view depth predictions on theKITTI dataset by initial Cityscapes model and the final model(pre-trained on Cityscapes and then fine-tuned on KITTI)
![Page 31: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/31.jpg)
ExperimentResults for depth map-Make3D
►Evaluate cross-dataset generalization ability.►Not seen during training►Still capable to capture the global scene layout reasonably well.
![Page 32: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/32.jpg)
ExperimentResults for Pose Estimation
►ORB-SLAM(full):Recovers odometry using all frames of the driving sequence(3 times more data)
►ORB-SLAM(short)Runs on 5-frame snippets
►When side-rotation is small, this network outperforms ORB-SLAM(short) and comparably to ORB-SLAM(full)
![Page 33: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/33.jpg)
Conclusion
► An end-to-end unsupervised learning pipeline.► Geometric consistency for learning 3D from unlabeled videos.► “Meta-” supervision: supervise how data behave
![Page 34: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/34.jpg)
FutureWork
► Explicitly estimate scene dynamics and occlusions.■ Direct modeling of scene dynamically.
► Address the situation with no camera intrinsic► More complicated way to represent 3D scene instead of depth map► Investigate in more detail the representation learned by this system.
![Page 35: Unsupervised Learning of Depth and Ego-Motion from Videocseweb.ucsd.edu/~mkchandraker/classes/CSE291/Winter2019/Lectures/01_SFMLearner.pdfUnsupervised Learning of Depth and Ego-Motion](https://reader033.fdocuments.us/reader033/viewer/2022042005/5e6fabd876dc3c268a2cd023/html5/thumbnails/35.jpg)
Thanks