Stereo R-CNN based 3D Object Detection for Autonomous...

Stereo R-CNN based 3D Object Detection for Autonomous DrivingPeiliang Li, Xiaozhi Chen, and Shaojie Shen (HKUST, DJI). Stereo R-CNN based 3D Object Detection for Autonomous Driving. arXiv:1902.09738v2, 2019

ECE 285 - AUTONOMOUS DRIVING SYSTEMS

Presented by Savitha Srinivasan

MOTIVATION

• Why 3D object detection?

• Existing methods:

• LIDAR-based: Accurate, but expensive

• Monocular camera-based: Does not guarantee accuracy

• What we are looking for: low-cost alternative that does not compromise

on accuracy

STEREO

• More precise depth information than monocular image-based methods

• Drawback of existing stereo-based methods: Don’t take advantage of the

dense object constraints in stereo images

[2] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun 3d object proposals using

stereo imagery for accurate object class detection. In TPAMI, 2017.

[4] B. Xu and Z. Chen. Multi-level fusion based 3d object detection. In IEEE CVPR, 2018.

OBJECTIVE

● To use the geometric and semantic information available in stereo imagery for 3D

object detection.

● How?

● Extend faster R-CNN to detect and associate objects in stereo images

● Estimate a 3D bounding box using keypoints and geometric constraints in the stereo pair

● Rectify the estimation using a dense region-based photometric alignment method.

FASTER R-CNN:

● Introduces Region Proposal Networks (RPN) ● Uses a predefined set of nine sliding windows called anchor boxes.● Regression gives offsets from anchor boxes to proposed RoI.

● Classification gives the probability that each proposed RoI shows an object.

CONCEPTUAL REVIEW

FEATURE PYRAMID NETWORKS FOR RPN

● FPN - Takes a single-scale image as input, and outputs proportionally sized feature maps at multiple

levels

● The single-scale feature map in RPN can be replaced with FPN

● Instead of using anchors of multiple scales on a single feature map, use a single scale anchor on each

level of the pyramid.

CONCEPTUAL REVIEW

METHODOLOGY

● The network architecture can be divided into 4 main parts:

○ Stereo RPN module

○ Stereo R-CNN

○ Stereo regression

○ Keypoint prediction

○ 3D box estimation

○ Dense alignment

NETWORK ARCHITECTURE OF STEREO R-CNN

STEREO RPN

● FPN is adapted for the stereo RPN.

● Input: Concatenated features from the left and right feature maps

● Output: Classification head – binary objectness classifier, Regression head -

[Δu,Δw,Δu’,Δw’,Δv,Δh]

● The stereo images are rectified, so Δv,Δh are same for the left and right boxes.

STEREO R-CNN

2 modules:

● Stereo regression - Takes concatenated features from left and right ROIs as input.

● Keypoint prediction - Predicts object keypoints only based on left ROI features

STEREO REGRESSION

● Four sub-branches are used to predict the stereo bounding boxes, object class,

dimension and viewpoint angle.

STEREO REGRESSION

● Viewpoint angle ⍺ = 𝜃 + β

● 𝜃 - vehicle orientation with respect to the camera frame

● β - object azimuth with respect to the camera center

● Regression term for dimension prediction: offset between the ground-truth dimension

and a pre-set dimension prior

KEYPOINT PREDICTION

● Semantic keypoints: Indicate the four corners of the bottom of the bounding box ● Perspective keypoints: One of the semantic keypoints that can be projected to the middle of the box ● Boundary keypoints: Indicate the region that belongs to the current object

● Output of keypoint branch: 6 channels● First four: probability of the semantic

keypoints being the perspective keypoint

● Last two: boundary keypoint probabilities

3-D BOX ESTIMATION

● States of the 3D bounding box: x = {𝓍,𝑦,𝑧,𝜃},

● Seven measurements made from 2-D bounding boxes and perspective keypoint: z =

{ul , ur , vt , vb , ul’, ur’, up }

● 3D-2D relations formulated using projection transformations:

● Solved by minimizing the reprojection error using Gauss-Newton’s method

3-D BOX ALIGNMENT

● Dense region-based photometric alignment method used to minimize errors

● Valid ROI: Region between the left and right boundary keypoints

● Photometric error:

● Il, Ir - intensities of left and right images

● b - baseline length

● Δzi = zi – z

● Total matching cost = SSD over all pixels

● Depth z is solved by minimizing the total matching cost

ANALYSIS AND RESULTS

NETWORK

● Backbone: Resnet-101 FPN ● Anchor boxes:

● Scales :{32, 64, 128, 126, 512} ● Aspect ratios :{0.5, 1, 2}

DATASET

● Evaluated on KITTI object detection benchmark● 50% for training set and 50% for validation set

DATA AUGMENTATION

● Flipping● Mirroring of keypoints

TRAINING

Multitask loss:

● cls: classification, reg: regression● p: RPN, r: stereo R-CNN● α, dim, key: sub-branches of stereo regression

● The network is trained using SGD with weight decay of 0.0005 and momentum 0.9

● Non-Maximum Suppression (NMS):

○ Training: 2000 candidates

○ Testing: 300 candidates

ANALYSIS AND RESULTS

QUALITATIVE RESULTS

Top: 3D bounding box detections, Bottom: 2D detections

QUANTITATIVE RESULTS

• Inference time 0.28 seconds on Titan Xp GPU• 2-D performance

• Comparable with Faster R-CNN• AP and AR for stereo: Jointly evaluate detection and association performance

QUANTITATIVE RESULTS

• 3-D performance• Significantly better compared to other stereo (3DOP[2], Muti-Fusion[4]), monocular methods

(Mono3D[5], Deep3DBox[6])

• Marginally better than LIDAR-based methods (VeloFCN[3])

EXPERIMENTS

• Left-right feature fusion: • Concatenation -

reported better results

• Element-wise mean • Photometric alignment

improved the results by a

large margin

ADVANTAGES

● Low-cost solution

● Gives more precise depth information than monocular methods

● Dense 3D box alignment improves accuracy

● Has the potential ability to provide larger-range perception by combining different

stereo modules with different focal length and baselines.

DISADVANTAGES

• Computationally intensive method• 2-stage pipeline• ResNet-101 FPN

• Does not work well for far-away objects • Inference time is large compared to single stage alternatives

TAKE-AWAY

● Learning-aided geometric approach

● Takes the advantage of both semantic properties and dense constraints of objects

● Ensures more accurate localization

REFERENCES

1. S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal

networks. In Advances in neural information processing systems, pages 91–99, 2015.

2. X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun 3d object proposals using stereo imagery for

accurate object class detection. In TPAMI, 2017.

3. B. Li, T. Zhang, and T. Xia. Vehicle detection from 3d lidar using fully convolutional network. In Robotics:

Science and Systems, 2016.

4. B. Xu and Z. Chen. Multi-level fusion based 3d object detection from monocular images. In IEEE CVPR,

2018.

5. X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3d object detection for

autonomous driving. In European Conference on Computer Vision, pages 2147–2156, 2016.

6. A. Mousavian, D. Anguelov, J. Flynn, and J. Koˇseck´a. 3d bounding box estimation using deep learning and

geometry. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5632–

5640. IEEE, 2017.

Stereo R-CNN based 3D Object Detection for Autonomous...

Documents

Transcript of Stereo R-CNN based 3D Object Detection for Autonomous...