Stereo R-CNN based 3D Object Detection for Autonomous...

29
Stereo R-CNN based 3D Object Detection for Autonomous Driving Peiliang Li, Xiaozhi Chen, and Shaojie Shen (HKUST, DJI). Stereo R-CNN based 3D Object Detection for Autonomous Driving. arXiv:1902.09738v2, 2019 ECE 285 - AUTONOMOUS DRIVING SYSTEMS Presented by Savitha Srinivasan

Transcript of Stereo R-CNN based 3D Object Detection for Autonomous...

  • Stereo R-CNN based 3D Object Detection for Autonomous DrivingPeiliang Li, Xiaozhi Chen, and Shaojie Shen (HKUST, DJI). Stereo R-CNN based 3D Object Detection for Autonomous Driving. arXiv:1902.09738v2, 2019

    ECE 285 - AUTONOMOUS DRIVING SYSTEMS

    Presented by Savitha Srinivasan

  • MOTIVATION

    • Why 3D object detection?

    • Existing methods:

    • LIDAR-based: Accurate, but expensive

    • Monocular camera-based: Does not guarantee accuracy

    • What we are looking for: low-cost alternative that does not compromise

    on accuracy

  • STEREO

    • More precise depth information than monocular image-based methods

    • Drawback of existing stereo-based methods: Don’t take advantage of the

    dense object constraints in stereo images

    [2] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun 3d object proposals using

    stereo imagery for accurate object class detection. In TPAMI, 2017.

    [4] B. Xu and Z. Chen. Multi-level fusion based 3d object detection. In IEEE CVPR, 2018.

  • OBJECTIVE

    ● To use the geometric and semantic information available in stereo imagery for 3D

    object detection.

    ● How?

    ● Extend faster R-CNN to detect and associate objects in stereo images

    ● Estimate a 3D bounding box using keypoints and geometric constraints in the stereo pair

    ● Rectify the estimation using a dense region-based photometric alignment method.

  • FASTER R-CNN:

    ● Introduces Region Proposal Networks (RPN) ● Uses a predefined set of nine sliding windows called anchor boxes.● Regression gives offsets from anchor boxes to proposed RoI.

    ● Classification gives the probability that each proposed RoI shows an object.

    CONCEPTUAL REVIEW

  • FEATURE PYRAMID NETWORKS FOR RPN

    ● FPN - Takes a single-scale image as input, and outputs proportionally sized feature maps at multiple

    levels

    ● The single-scale feature map in RPN can be replaced with FPN

    ● Instead of using anchors of multiple scales on a single feature map, use a single scale anchor on each

    level of the pyramid.

    CONCEPTUAL REVIEW

  • METHODOLOGY

    ● The network architecture can be divided into 4 main parts:

    ○ Stereo RPN module

    ○ Stereo R-CNN

    ○ Stereo regression

    ○ Keypoint prediction

    ○ 3D box estimation

    ○ Dense alignment

  • NETWORK ARCHITECTURE OF STEREO R-CNN

  • STEREO RPN

    ● FPN is adapted for the stereo RPN.

    ● Input: Concatenated features from the left and right feature maps

    ● Output: Classification head – binary objectness classifier, Regression head -

    [Δu,Δw,Δu’,Δw’,Δv,Δh]

    ● The stereo images are rectified, so Δv,Δh are same for the left and right boxes.

  • NETWORK ARCHITECTURE OF STEREO R-CNN

  • STEREO R-CNN

    2 modules:

    ● Stereo regression - Takes concatenated features from left and right ROIs as input.

    ● Keypoint prediction - Predicts object keypoints only based on left ROI features

    STEREO REGRESSION

    ● Four sub-branches are used to predict the stereo bounding boxes, object class,

    dimension and viewpoint angle.

  • STEREO REGRESSION

    ● Viewpoint angle ⍺ = 𝜃 + β

    ● 𝜃 - vehicle orientation with respect to the camera frame

    ● β - object azimuth with respect to the camera center

    ● Regression term for dimension prediction: offset between the ground-truth dimension

    and a pre-set dimension prior

  • NETWORK ARCHITECTURE OF STEREO R-CNN

  • KEYPOINT PREDICTION

    ● Semantic keypoints: Indicate the four corners of the bottom of the bounding box ● Perspective keypoints: One of the semantic keypoints that can be projected to the middle of the box ● Boundary keypoints: Indicate the region that belongs to the current object

    ● Output of keypoint branch: 6 channels● First four: probability of the semantic

    keypoints being the perspective keypoint

    ● Last two: boundary keypoint probabilities

  • NETWORK ARCHITECTURE OF STEREO R-CNN

  • 3-D BOX ESTIMATION

    ● States of the 3D bounding box: x = {𝓍,𝑦,𝑧,𝜃},

    ● Seven measurements made from 2-D bounding boxes and perspective keypoint: z =

    {ul , ur , vt , vb , ul’, ur’, up }

  • ● 3D-2D relations formulated using projection transformations:

    ● Solved by minimizing the reprojection error using Gauss-Newton’s method

  • NETWORK ARCHITECTURE OF STEREO R-CNN

  • 3-D BOX ALIGNMENT

    ● Dense region-based photometric alignment method used to minimize errors

    ● Valid ROI: Region between the left and right boundary keypoints

    ● Photometric error:

    ● Il, Ir - intensities of left and right images

    ● b - baseline length

    ● Δzi = zi – z

    ● Total matching cost = SSD over all pixels

    ● Depth z is solved by minimizing the total matching cost

  • ANALYSIS AND RESULTS

    NETWORK

    ● Backbone: Resnet-101 FPN ● Anchor boxes:

    ● Scales :{32, 64, 128, 126, 512} ● Aspect ratios :{0.5, 1, 2}

    DATASET

    ● Evaluated on KITTI object detection benchmark● 50% for training set and 50% for validation set

    DATA AUGMENTATION

    ● Flipping● Mirroring of keypoints

  • TRAINING

    Multitask loss:

    ● cls: classification, reg: regression● p: RPN, r: stereo R-CNN● α, dim, key: sub-branches of stereo regression

    ● The network is trained using SGD with weight decay of 0.0005 and momentum 0.9

    ● Non-Maximum Suppression (NMS):

    ○ Training: 2000 candidates

    ○ Testing: 300 candidates

    ANALYSIS AND RESULTS

  • QUALITATIVE RESULTS

    Top: 3D bounding box detections, Bottom: 2D detections

  • QUANTITATIVE RESULTS

    • Inference time 0.28 seconds on Titan Xp GPU• 2-D performance

    • Comparable with Faster R-CNN• AP and AR for stereo: Jointly evaluate detection and association performance

  • QUANTITATIVE RESULTS

    • 3-D performance• Significantly better compared to other stereo (3DOP[2], Muti-Fusion[4]), monocular methods

    (Mono3D[5], Deep3DBox[6])

    • Marginally better than LIDAR-based methods (VeloFCN[3])

  • EXPERIMENTS

    • Left-right feature fusion: • Concatenation -

    reported better results

    • Element-wise mean • Photometric alignment

    improved the results by a

    large margin

  • ADVANTAGES

    ● Low-cost solution

    ● Gives more precise depth information than monocular methods

    ● Dense 3D box alignment improves accuracy

    ● Has the potential ability to provide larger-range perception by combining different

    stereo modules with different focal length and baselines.

  • DISADVANTAGES

    • Computationally intensive method• 2-stage pipeline• ResNet-101 FPN

    • Does not work well for far-away objects • Inference time is large compared to single stage alternatives

  • TAKE-AWAY

    ● Learning-aided geometric approach

    ● Takes the advantage of both semantic properties and dense constraints of objects

    ● Ensures more accurate localization

  • REFERENCES

    1. S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal

    networks. In Advances in neural information processing systems, pages 91–99, 2015.

    2. X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun 3d object proposals using stereo imagery for

    accurate object class detection. In TPAMI, 2017.

    3. B. Li, T. Zhang, and T. Xia. Vehicle detection from 3d lidar using fully convolutional network. In Robotics:

    Science and Systems, 2016.

    4. B. Xu and Z. Chen. Multi-level fusion based 3d object detection from monocular images. In IEEE CVPR,

    2018.

    5. X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3d object detection for

    autonomous driving. In European Conference on Computer Vision, pages 2147–2156, 2016.

    6. A. Mousavian, D. Anguelov, J. Flynn, and J. Koˇseck´a. 3d bounding box estimation using deep learning and

    geometry. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5632–

    5640. IEEE, 2017.