Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Learning

Bridging the gap between 2D and 3D with Deep Learning

Evgeny Burnaev (PhD) <[email protected]>assoc. prof. Skoltech

Alexandr Notchenko <[email protected]>PhD student

mailto:[email protected]

mailto:[email protected]

[1]

#

ImageNet top-5 error over the years

- Deep learning based methods- Feature based methods- human performance

Supervised Deep Learning dataType

2D Image classification, detection segmentation

Pose Estimation

Supervision

class label , object detection box, segmentation contours

Structure of “skeleton” on image

But world is in 3D

3D deep learning is gaining popularity

Workshops:● Deep Learning for Robotic Vision Workshop

CVPR 2017● Geometry Meets Deep Learning ECCV 2016 ● 3D Deep Learning Workshop @ NIPS 2016● Large Scale 3D Data: Acquisition, Modelling

and Analysis CVPR 2016● 3D from a Single Image CVPR 2015

Google Scholar when searched for "3D" "Deep Learning" returns:

year # articles

2012 410

2013 627

2014 1210

2015 2570

2016 5440

Representation of 3D data for Deep Learning

Method Pros (+) Cons (-)

Many 2D projections sustain surface texture,There is a lot of 2D DL methods

Redundant representation,vulnerable to optic illusions

Voxels simple, can be sparse, has volumetric properties

losing surface properties

Point Cloud Can be sparse losing surface properties and volumetric properties

2.5D images Cheap measurement devices, senses depth

self occlusion of bodies in a scene, a lot of Noise in

measurements

[6]

#

[2]

#

3D shape as dense Point Cloud

Learning Rich Features from RGB-D Images forObject Detection and Segmentation

[10]

#

Latest development in SLAM family of methods

LSD-SLAM (Large-Scale Direct Monocular Simultaneous Localization and Mapping)

[5]LSD-SLAM - direct (feature-less) monocular SLAM

ElasticFusion

ElasticFusion - DenseSLAM without a pose-graph [7]

Dynamic Fusion

The technique won the prestigious CVPR 2015 best paper award. [9]

Problems of SLAM algorithms● Don’t represent objects (only know surfaces)

● Mostly dense representation (requires a lot of data)

● Whole scene is one big surface, e.g. cannot separate different objects that

are close to each other.

3D Shape Retrieval

3D Design Phase•

There exists massive storages with 3D CAD models, e.g. GrabCADChairs Mechanical parts

3D Design Phase•Designers spend about 60% of their time searching for the right information• Massive and complex CAD models are

usually disorderly archived in enterprises, which makes design reuse a difficult task

3D Model retrieval can significantly shorten the product lifecycles

3D Shape-based Model Retrieval•3D models are complex = No clear search rules•The text-based search has its limitations: e.g. often 3D models are poorly annotated

• There is some commercial software for 3D CAD modeling, e.g.➢ Exalead OnePart by Dassault Systems,➢ Geolus Search by Siemens PLM, and others

• However, used methods ➢ are time-consuming,➢ are often based on hand-crafted descriptors, ➢ could be limited to a specific class of shapes, ➢ are not robust to scaling, rotations, etc.

Sparse 3D Convolutional Neural Networks for Large-Scale Shape Retrieval

Alexandr Notchenko, Ermek Kapushev, Evgeny Burnaev

Presented at 3D Deep Learning Workshop at NIPS 2016

Sparsity of voxel representation

30^3 Voxels is already enough to understand simple shape

But with texture information it would be even easier

Sparsity for all classes of ModelNet40 train dataset at voxel resolution 40 is only 5.5%

Shape Retrieval

Precomputedfeature vector of

dataset.(Vcar , Vperson ,...)

Vplane - feature vector of plane

Sparse3DCNN

Query

Retrieved items

Cosine distance

Triplet loss

The representation can be efficiently learned by minimizing triplet loss.

Triplet is a set (a, p, n), where

● a - anchor object● p - positive object that is similar to anchor object● n - negative object that is not similar to anchor object

, where is a margin parameter, and are distances between p and a and n and a.

Our approach

● Use very large resolutions, and sparse representations.

● Used triplet learning for 3D shapes.

● Used Large Scale Shape Datasets ModelNet and ShapeNet.

Represent voxel shape as vector

Obligatory t-SNE

Conclusions

● For small datasets of shape or 3D sparse tensors voxels can work.

● Voxels don’t scale for hundreds of “classes” and loose texture information.

● Cannot encode complicated object domains.

Problems for next 5 years

Autonomous Vehicles

Augmented (Mixed) Reality

Robotics in human environments

Robotic Control in Human Environments

Commodity sensors to create 2.5D images

Intel RealSense Series

Asus Xtion Pro

Microsoft Kinect v2

Structure Sensor

What they have in common?

What they have in common?

They require understanding the whole scene

Problem of “Holistic” Scene understanding

Lin D., Fidler S., Urtasun R. Holistic scene understanding for 3d object detection with rgbd cameras //Proceedings of the IEEE International Conference on Computer Vision. – 2013. – С. 1417-1424.

● Human environments often designed by humans● A most of the objects are created by humans● Context provides information by joint probability functions● Textures caused by materials and therefore can explain a functions and

structure of an object

Problem of “Holistic” Scene understanding

Connecting 3 families of CV algorithms is inevitable

Learnable Computer Vision Systems(Deep Learning)

Geometric Computer Vision (SLAMs)

Probabilistic Computer Vision

(Bayesian methods)

Connecting 3 families of CV algorithms is inevitable

Learnable Computer Vision Systems(Deep Learning)

Geometric Computer Vision (SLAMs)

Probabilistic Computer Vision

(Bayesian methods)

ProbabilisticInverse

Graphics

Probabilistic Inverse Graphics enables● Takes into account setting information (shop: shelves and products | street: buildings,

cars, pedestrians)

● Make maximum likelihood estimates from data and model (or give directions on how

to reduce uncertainty the best way)

● Learns structure of objects (Materials and textures / 3D shape / intrinsic dynamics)

Thank you.

Alexandr Notchenko Ermek Kapushev Evgeny Burnaev

Citations and Links1. Deep Learning NIPS’2015 Tutorial by Geoff Hinton, Yoshua Bengio & Yann LeCun2. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J. (2015). 3D ShapeNets: A Deep Representation for Volumetric Shapes.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1912-1920).3. C. Nash, C. Williams Generative Models of Part-Structured 3D Objects4. Qin, Fei-wei, et al. "A deep learning approach to the classification of 3D CAD models." Journal of Zhejiang University SCIENCE C 15.2

(2014): 91-106.5. Engel, Jakob, Thomas Schöps, and Daniel Cremers. "LSD-SLAM: Large-scale direct monocular SLAM." European Conference on Computer

Vision. Springer International Publishing, 2014.6. Su, Hang, et al. "Multi-view convolutional neural networks for 3D shape recognition." Proceedings of the IEEE International Conference on

Computer Vision. 2015.7. Whelan, Thomas, et al. "ElasticFusion: Dense SLAM Without A Pose Graph." Robotics: science and systems. Vol. 11. 2015.8. Notchenko, Alexandr, Ermek Kapushev, and Evgeny Burnaev. "Sparse 3D Convolutional Neural Networks for Large-Scale Shape Retrieval."

arXiv preprint arXiv:1611.09159 (2016).9. Newcombe, Richard A., Dieter Fox, and Steven M. Seitz. "Dynamicfusion: Reconstruction and tracking of non-rigid scenes in

real-time." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.10. Gupta, Saurabh, et al. "Learning rich features from RGB-D images for object detection and segmentation." European Conference on

Computer Vision. Springer International Publishing, 2014.

https://www.google.com/url?q=https://aiatadams.files.wordpress.com/2016/02/bengio-lecun-20151207-deep-learning-tutorial-nips.pdf&sa=D&ust=1487685093071000&usg=AFQjCNE-HEbn1Rks9lIDbewzT1qTw5zbFg

https://www.google.com/url?q=https://aiatadams.files.wordpress.com/2016/02/bengio-lecun-20151207-deep-learning-tutorial-nips.pdf&sa=D&ust=1487685093072000&usg=AFQjCNEpmA_bK0pcjyrSmPhrYDhzgl5WLQ

https://www.google.com/url?q=http://www.cv-foundation.org/openaccess/content_cvpr_2015/app/1B_087.pdf&sa=D&ust=1487685093072000&usg=AFQjCNHK4Fe5868FsbDK9SM5BQh7V_TuQQ

https://www.google.com/url?q=http://www.cv-foundation.org/openaccess/content_cvpr_2015/app/1B_087.pdf&sa=D&ust=1487685093072000&usg=AFQjCNHK4Fe5868FsbDK9SM5BQh7V_TuQQ

https://www.google.com/url?q=http://www.cv-foundation.org/openaccess/content_cvpr_2015/app/1B_087.pdf&sa=D&ust=1487685093073000&usg=AFQjCNFLwohIeCU9C1JWdF15XZ1Y4YHjhg

https://www.google.com/url?q=http://www.cv-foundation.org/openaccess/content_cvpr_2015/app/1B_087.pdf&sa=D&ust=1487685093073000&usg=AFQjCNFLwohIeCU9C1JWdF15XZ1Y4YHjhg

https://www.google.com/url?q=http://3ddl.cs.princeton.edu/2016/abstracts/nash_poster.pdf&sa=D&ust=1487685093073000&usg=AFQjCNEegBpjj83BnVE0aSxOHS71mrJBRg

https://www.google.com/url?q=http://3ddl.cs.princeton.edu/2016/abstracts/nash_poster.pdf&sa=D&ust=1487685093073000&usg=AFQjCNEegBpjj83BnVE0aSxOHS71mrJBRg

https://www.google.com/url?q=http://www.zju.edu.cn/jzus/oldversion/opentxt.php?doi%3D10.1631/jzus.C1300185&sa=D&ust=1487685093074000&usg=AFQjCNHhAURVLz6iCr-OFS_M10iaQBekZw




https://www.google.com/url?q=http://www.zju.edu.cn/jzus/oldversion/opentxt.php?doi%3D10.1631/jzus.C1300185&sa=D&ust=1487685093075000&usg=AFQjCNGEJeMphGivsJ_4qfsJ_5Rbu11sDQ

https://www.google.com/url?q=http://www.zju.edu.cn/jzus/oldversion/opentxt.php?doi%3D10.1631/jzus.C1300185&sa=D&ust=1487685093075000&usg=AFQjCNGEJeMphGivsJ_4qfsJ_5Rbu11sDQ

https://www.google.com/url?q=https://pdfs.semanticscholar.org/fe54/df2c0a2de40c7e11d6bd9ad2d2395f02f3ed.pdf&sa=D&ust=1487685093075000&usg=AFQjCNH4_evvpjCtV1eMHx4PQ6bVBqmb4w

https://www.google.com/url?q=https://pdfs.semanticscholar.org/fe54/df2c0a2de40c7e11d6bd9ad2d2395f02f3ed.pdf&sa=D&ust=1487685093076000&usg=AFQjCNEbFMIDtqomT-BO5JQgWgRbpCYwrg




https://www.google.com/url?q=https://pdfs.semanticscholar.org/fe54/df2c0a2de40c7e11d6bd9ad2d2395f02f3ed.pdf&sa=D&ust=1487685093077000&usg=AFQjCNHD-EOzUbBetonBpxYgQQ6ijbBiZw

https://www.google.com/url?q=http://people.cs.umass.edu/~elm/papers/HangSu_3D_arXiv.pdf&sa=D&ust=1487685093077000&usg=AFQjCNERXLKyPafDLr2QqaVF0gmfs-F54Q

https://www.google.com/url?q=http://people.cs.umass.edu/~elm/papers/HangSu_3D_arXiv.pdf&sa=D&ust=1487685093077000&usg=AFQjCNERXLKyPafDLr2QqaVF0gmfs-F54Q

https://www.google.com/url?q=http://people.cs.umass.edu/~elm/papers/HangSu_3D_arXiv.pdf&sa=D&ust=1487685093078000&usg=AFQjCNFkCSLbVsYUgWeAYYnkzj58fGlQ3A




https://www.google.com/url?q=https://pdfs.semanticscholar.org/655b/14a3a3033182a27aa01d5cb16b99f0419a70.pdf&sa=D&ust=1487685093079000&usg=AFQjCNEQ_07qvbl5EpHUUvflDZdky4YILQ



https://www.google.com/url?q=https://pdfs.semanticscholar.org/655b/14a3a3033182a27aa01d5cb16b99f0419a70.pdf&sa=D&ust=1487685093080000&usg=AFQjCNF6XTn6xA1r204ZxTOdSLhI1JPBng

https://www.google.com/url?q=https://arxiv.org/pdf/1611.09159&sa=D&ust=1487685093080000&usg=AFQjCNFXXVemEWgUWDNzGGj2fT8XAR7nWw



https://www.google.com/url?q=https://arxiv.org/pdf/1611.09159&sa=D&ust=1487685093081000&usg=AFQjCNE1cTSiRCwCS5_G76wd1j719EuBFg

https://www.google.com/url?q=https://arxiv.org/pdf/1611.09159&sa=D&ust=1487685093081000&usg=AFQjCNE1cTSiRCwCS5_G76wd1j719EuBFg

https://www.google.com/url?q=http://grail.cs.washington.edu/projects/dynamicfusion/papers/DynamicFusion.pdf&sa=D&ust=1487685093081000&usg=AFQjCNFcNZd4ItLYYhhcwCV_DLFYCR_9DQ

https://www.google.com/url?q=http://grail.cs.washington.edu/projects/dynamicfusion/papers/DynamicFusion.pdf&sa=D&ust=1487685093081000&usg=AFQjCNFcNZd4ItLYYhhcwCV_DLFYCR_9DQ

https://www.google.com/url?q=http://grail.cs.washington.edu/projects/dynamicfusion/papers/DynamicFusion.pdf&sa=D&ust=1487685093082000&usg=AFQjCNHv95WZa02y6W_tMZ9FR4y81UmH0A

https://www.google.com/url?q=http://grail.cs.washington.edu/projects/dynamicfusion/papers/DynamicFusion.pdf&sa=D&ust=1487685093082000&usg=AFQjCNHv95WZa02y6W_tMZ9FR4y81UmH0A

https://www.google.com/url?q=https://pdfs.semanticscholar.org/d3be/d4a7214e0f15ffbbc29e7658f363cc984268.pdf&sa=D&ust=1487685093082000&usg=AFQjCNEaIC8kSDWsAkHsAwkYJfB568jjRg

https://www.google.com/url?q=https://pdfs.semanticscholar.org/d3be/d4a7214e0f15ffbbc29e7658f363cc984268.pdf&sa=D&ust=1487685093083000&usg=AFQjCNHbU1tYp1qbMvkFJNmtN-RT1OkMAw



https://www.google.com/url?q=https://pdfs.semanticscholar.org/d3be/d4a7214e0f15ffbbc29e7658f363cc984268.pdf&sa=D&ust=1487685093084000&usg=AFQjCNENhddBorekgYHXDdLgYYnKE9QN-g

https://www.google.com/url?q=https://pdfs.semanticscholar.org/d3be/d4a7214e0f15ffbbc29e7658f363cc984268.pdf&sa=D&ust=1487685093084000&usg=AFQjCNENhddBorekgYHXDdLgYYnKE9QN-g

Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Learning

Technology

Transcript of Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Learning