Modeling Traffic Scenes for Intelligent Vehicles using CNN ...

Modeling Traffic Scenes for Intelligent Vehicles using

CNN-based Detection and Orientation Estimation

Carlos Guindel, David Martín and José María Armingol

Intelligent Systems Laboratory (LSI) · Universidad Carlos III de Madrid

Sevilla · 23 November 2017

Agenda

Introduction

Obstacle detection

Scene modeling

Results

Conclusion

2

Modeling Traffic Scenes for Intelligent Vehicles using CNN-based Detection and Orientation Estimation

C. Guindel et al. · ROBOT 2017

Agenda

Introduction

Obstacle detection

Scene modeling

Results

Conclusion

3



Introduction4



• A basic requirement for driving tasks

Obstacle detection

• An accurate estimation of the class is essential

Classification• Close-to-market

assemblies

• Rich data source

Vision-based approaches

• Feature learning

• The new paradigm in computer vision

Convolutional Neural Networks

Automated vehicles

• Highly dynamic, semi-structured environments

• They have to handle complex situations

IVVI 2.0 project5



INTELLIGENT VEHICLE BASED ON VISUAL INFORMATION 2.0

Trinocular stereo cam.

Multi-layer lidar scanner

Side-looking cameras

Computer with GPU

+info:uc3m.es/islab

System overview6



• Two main branches intended to run in parallel

• Obstacle detection

• Features are extracted exclusively from the left stereo image

• Scene modeling

• Stereo-based 3D reconstruction & flat-ground assumption

Agenda

Introduction

Obstacle detection

Scene modeling

Results

Conclusion

7



Faster R-CNN framework



S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2016.

Convolutional features computed only once per image

A RPN generates proposals wrt. a fixed set of anchors

Conv. features in these regionsare pooled for classification

Parameters are learned through a multi-task loss

8

Viewpoint estimation

• Faster R-CNN framework was modified to introduce viewpoint inference

9



C. Guindel, D. Martin, and J. M. Armingol, “Joint object detection and viewpoint estimation using CNN features,” in Proc. of the IEEE International Conference on Vehicular Electronics and Safety (ICVES), 2017, pp. 145–150.

Final estimation: Θ𝑖∗ → መ𝜃

Discrete viewpoint inference



𝑁𝑏 angle bins Θ𝑖 …Θ𝑁𝑏

𝑁𝑏 = 8Training: 𝜃𝑖0→ Θ𝑖

Inference output: r ∈ Δ𝑁𝑏−1

𝑟

Elements of 𝑟

10

• Every object is assigned a bin

• Inference gives a categorial distribution

Joint detection and viewpoint estimation



11

Feature map

Proposal

Fixed sizefeat. vector

Fully connected (FC) layers

FC layer FC layer FC layer

B. Box regression

Softmax Softmax Softmax

Class Viewpoint

Only 𝑁𝑏 · 𝐾 · 4096new weights

…

𝑁𝑏

· 𝐾

𝑁𝑏 · 𝐾

elementsangle bins

classes

Fast

er R

-CN

N lo

ss

Loss function and training

• Unweighted muli-task loss with five components

Joint Object Detection and Viewpoint Estimation using CNN featuresCarlos Guindel · ICVES 2017

Logistic lossfor RPN objectness

Smooth-L1 lossfor RPN b.box regression

Logistic lossfor class

Smooth-L1 lossfor b.box regression

Logistic lossfor viewpoint

estimation

12

…

𝑁𝑏 angle bins of the ground truth class

Ground-truth angle bin

Normalized on the batch

Agenda

Introduction

Obstacle detection

Scene modeling

Results

Conclusion

13



Scene modeling14



Left image

Right image

DisparityP.C.

generatorDisparity

map

3D point cloud

3D RECONSTRUCTION

Scene modeling15



Left image

Right image

DisparityP.C.

generatorDisparity

map

3D point cloud

3D RECONSTRUCTION

Scene modeling16



Left image

Right image

DisparityP.C.

generatorDisparity

map

3D point cloud

3D RECONSTRUCTION

SGM stereo matchingSuitable for environments with lack of texture, illumination changes, etc.

Scene modeling17



Left image

Right image

DisparityP.C.

generatorDisparity

map

3D point cloud

3D RECONSTRUCTION

Pin-hole + disparityWe build a XYZRGB cloud

from the left image and the disparity map

Scene modeling18



Left image

Right image

DisparityP.C.

generatorDisparity

map

3D point cloud

3D point cloud

Plane

segmentation

Plane model

Calibration from plane

Camera-to-world

calibration

3D RECONSTRUCTION

EXTRINSIC PARAMETERS AUTO-CALIBRATION

Scene modeling19



3D point cloud

Plane

segmentation

Plane model


Camera-to-world

calibration


Voxel grid dowsamplingThe cloud from the 3D reconstruction pipeline is downsampled (grid size: 20 cm)

Scene Modeling20



3D point cloud

Plane

segmentation

Plane model


Camera-to-world

calibration


Planar segmentationUsing RANSAC with a 10 cm threshold, and a small angular tolerance.

…Pass through filtersVertical axis: 0-2 mDepth axis: 0-20 m

Scene modeling21



3D point cloud

Plane

segmentation

Plane model


Camera-to-world

calibration


roll

pitchPlane:

Object localization22



Object localization23



Object ROI for localization

11 central rows of the object’s bounding box

Filtered cloudWithout points belonging to the ground or too close to the camera

Organized pointcloud

Coordinates on the world’sXY plane for every point

𝒙 = (𝑥, 𝑦, 𝑧)

𝒙𝑜𝑏𝑗 = 𝑚𝑒𝑑𝑖𝑎𝑛 𝒙

𝜃𝑜𝑏𝑗 = 𝛼 − atan2(𝑦𝑜𝑏𝑗 − 𝑥𝑜𝑏𝑗)

x

𝑦𝑜𝑏𝑗

world

y

𝑥𝑜𝑏𝑗

𝜃

Top-down view

yaw

Agenda

Introduction

Obstacle detection

Scene modeling

Results

Conclusion

24



Results: Detection and viewpoint estimation

• KITTI Object Detection Benchmark

• 5,576 images for training and 2,065 for validation

• Labels for class and orientation available

• Evaluation metric

• Average Orientation Similarity (AOS)

25



Results: Detection and viewpoint estimation

• Two different architectures:

• ZF (lightweight) and VGG 16-layer (more complex)

• Three different scales (height in pixels):

• 375, 500, 625

26



88,43 66,28 63,41 N.A. N.A. N.A. 2 sec.Top-performing

comparable method in the KITTI ranking

(ms)

Results: Scene modeling27



Agenda

Introduction

Obstacle detection

Scene modeling

Results

Conclusion

30



Conclusion

• Towards a full object-based scene understanding

• CNN-based detection and viewpoint inference

• Efficient approach: the same set of features is used for all tasks

• Stereo-vision 3D information is included for situation assessment

• Results validate our approach

31



Code for CNN detection & viewpoints available athttps://github.com/cguindel/lsi-faster-rcnn

Future work

• New categories of traffic elements

• Extension to the time domain

• Tracking, filtering, etc.

• Including information from other perception modules

• E.g., semantic segmentation

THANKS FOR

YOUR ATTENTION

23 November 2017

ROBOT'2017 - Third Iberian Robotics Conference

Carlos Guindel · cguindel.github.io · [email protected]

Modeling Traffic Scenes for Intelligent Vehicles using CNN ...

Documents

Transcript of Modeling Traffic Scenes for Intelligent Vehicles using CNN ...