Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Learning
-
Upload
skolkovo-robotics-center -
Category
Technology
-
view
116 -
download
0
Transcript of Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Learning
Bridging the gap between 2D and 3D with Deep Learning
Evgeny Burnaev (PhD) <[email protected]>assoc. prof. Skoltech
Alexandr Notchenko <[email protected]>PhD student
[1]
ImageNet top-5 error over the years
- Deep learning based methods- Feature based methods- human performance
Supervised Deep Learning dataType
2D Image classification, detection segmentation
Pose Estimation
Supervision
class label , object detection box, segmentation contours
Structure of “skeleton” on image
But world is in 3D
3D deep learning is gaining popularity
Workshops:● Deep Learning for Robotic Vision Workshop
CVPR 2017● Geometry Meets Deep Learning ECCV 2016 ● 3D Deep Learning Workshop @ NIPS 2016● Large Scale 3D Data: Acquisition, Modelling
and Analysis CVPR 2016● 3D from a Single Image CVPR 2015
Google Scholar when searched for "3D" "Deep Learning" returns:
year # articles
2012 410
2013 627
2014 1210
2015 2570
2016 5440
Representation of 3D data for Deep Learning
Method Pros (+) Cons (-)
Many 2D projections sustain surface texture,There is a lot of 2D DL methods
Redundant representation,vulnerable to optic illusions
Voxels simple, can be sparse, has volumetric properties
losing surface properties
Point Cloud Can be sparse losing surface properties and volumetric properties
2.5D images Cheap measurement devices, senses depth
self occlusion of bodies in a scene, a lot of Noise in
measurements
[6]
[2]
3D shape as dense Point Cloud
Latest development in SLAM family of methods
LSD-SLAM (Large-Scale Direct Monocular Simultaneous Localization and Mapping)
[5]LSD-SLAM - direct (feature-less) monocular SLAM
ElasticFusion
ElasticFusion - DenseSLAM without a pose-graph [7]
Dynamic Fusion
The technique won the prestigious CVPR 2015 best paper award. [9]
Problems of SLAM algorithms● Don’t represent objects (only know surfaces)
● Mostly dense representation (requires a lot of data)
● Whole scene is one big surface, e.g. cannot separate different objects that
are close to each other.
3D Shape Retrieval
3D Design Phase•
There exists massive storages with 3D CAD models, e.g. GrabCADChairs Mechanical parts
3D Design Phase•Designers spend about 60% of their time searching for the right information• Massive and complex CAD models are
usually disorderly archived in enterprises, which makes design reuse a difficult task
3D Model retrieval can significantly shorten the product lifecycles
3D Shape-based Model Retrieval•3D models are complex = No clear search rules•The text-based search has its limitations: e.g. often 3D models are poorly annotated
• There is some commercial software for 3D CAD modeling, e.g.➢ Exalead OnePart by Dassault Systems,➢ Geolus Search by Siemens PLM, and others
• However, used methods ➢ are time-consuming,➢ are often based on hand-crafted descriptors, ➢ could be limited to a specific class of shapes, ➢ are not robust to scaling, rotations, etc.
Sparse 3D Convolutional Neural Networks for Large-Scale Shape Retrieval
Alexandr Notchenko, Ermek Kapushev, Evgeny Burnaev
Presented at 3D Deep Learning Workshop at NIPS 2016
Sparsity of voxel representation
30^3 Voxels is already enough to understand simple shape
But with texture information it would be even easier
Sparsity for all classes of ModelNet40 train dataset at voxel resolution 40 is only 5.5%
Shape Retrieval
Precomputedfeature vector of
dataset.(Vcar , Vperson ,...)
Vplane - feature vector of plane
Sparse3DCNN
Query
Retrieved items
Cosine distance
Triplet loss
The representation can be efficiently learned by minimizing triplet loss.
Triplet is a set (a, p, n), where
● a - anchor object● p - positive object that is similar to anchor object● n - negative object that is not similar to anchor object
, where is a margin parameter, and are distances between p and a and n and a.
Our approach
● Use very large resolutions, and sparse representations.
● Used triplet learning for 3D shapes.
● Used Large Scale Shape Datasets ModelNet and ShapeNet.
Represent voxel shape as vector
Obligatory t-SNE
Conclusions
● For small datasets of shape or 3D sparse tensors voxels can work.
● Voxels don’t scale for hundreds of “classes” and loose texture information.
● Cannot encode complicated object domains.
Problems for next 5 years
Autonomous Vehicles
Augmented (Mixed) Reality
Robotics in human environments
Robotic Control in Human Environments
Commodity sensors to create 2.5D images
Intel RealSense Series
Asus Xtion Pro
Microsoft Kinect v2
Structure Sensor
What they have in common?
What they have in common?
They require understanding the whole scene
Problem of “Holistic” Scene understanding
Lin D., Fidler S., Urtasun R. Holistic scene understanding for 3d object detection with rgbd cameras //Proceedings of the IEEE International Conference on Computer Vision. – 2013. – С. 1417-1424.
● Human environments often designed by humans● A most of the objects are created by humans● Context provides information by joint probability functions● Textures caused by materials and therefore can explain a functions and
structure of an object
Problem of “Holistic” Scene understanding
Connecting 3 families of CV algorithms is inevitable
Learnable Computer Vision Systems(Deep Learning)
Geometric Computer Vision (SLAMs)
Probabilistic Computer Vision
(Bayesian methods)
Connecting 3 families of CV algorithms is inevitable
Learnable Computer Vision Systems(Deep Learning)
Geometric Computer Vision (SLAMs)
Probabilistic Computer Vision
(Bayesian methods)
ProbabilisticInverse
Graphics
Probabilistic Inverse Graphics enables● Takes into account setting information (shop: shelves and products | street: buildings,
cars, pedestrians)
● Make maximum likelihood estimates from data and model (or give directions on how
to reduce uncertainty the best way)
● Learns structure of objects (Materials and textures / 3D shape / intrinsic dynamics)
Thank you.
Alexandr Notchenko Ermek Kapushev Evgeny Burnaev
Citations and Links1. Deep Learning NIPS’2015 Tutorial by Geoff Hinton, Yoshua Bengio & Yann LeCun2. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J. (2015). 3D ShapeNets: A Deep Representation for Volumetric Shapes.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1912-1920).3. C. Nash, C. Williams Generative Models of Part-Structured 3D Objects4. Qin, Fei-wei, et al. "A deep learning approach to the classification of 3D CAD models." Journal of Zhejiang University SCIENCE C 15.2
(2014): 91-106.5. Engel, Jakob, Thomas Schöps, and Daniel Cremers. "LSD-SLAM: Large-scale direct monocular SLAM." European Conference on Computer
Vision. Springer International Publishing, 2014.6. Su, Hang, et al. "Multi-view convolutional neural networks for 3D shape recognition." Proceedings of the IEEE International Conference on
Computer Vision. 2015.7. Whelan, Thomas, et al. "ElasticFusion: Dense SLAM Without A Pose Graph." Robotics: science and systems. Vol. 11. 2015.8. Notchenko, Alexandr, Ermek Kapushev, and Evgeny Burnaev. "Sparse 3D Convolutional Neural Networks for Large-Scale Shape Retrieval."
arXiv preprint arXiv:1611.09159 (2016).9. Newcombe, Richard A., Dieter Fox, and Steven M. Seitz. "Dynamicfusion: Reconstruction and tracking of non-rigid scenes in
real-time." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.10. Gupta, Saurabh, et al. "Learning rich features from RGB-D images for object detection and segmentation." European Conference on
Computer Vision. Springer International Publishing, 2014.