Perception Systems for Autonomous Vehicles using Energy ... · Perception Systems for Autonomous...

Post on 22-May-2020

1 views 0 download

Transcript of Perception Systems for Autonomous Vehicles using Energy ... · Perception Systems for Autonomous...

Perception Systems for Autonomous Vehicles using Energy-Efficient Deep Neural Networks

Forrest Iandola, Ben Landen, Kyle Bertin, Kurt Keutzer and the DeepScale Team

I M P L E M E N T I N G A U TO N O M O U S D R I V I N G

THE FLOW

SENSORS

LIDAR

ULTRASONIC CAMERA

RADAR

OFFLINE MAPS

REAL-TIME PERCEPTION

PATH PLANNING &

ACTUATION

What does a car need to see? What does a car need to see?

Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy.

What does a car need to see?

Object Detection

Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy.

Vehicle (99%) Vehicle (98%) Vehicle (100%) Vehicle (100%) Vehicle (99%)

Cyclist (99%)

Cyclist (99%)

Pedestrian (99%) Pedestrian

(99%)

What does a car need to see?

Distance

Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy.

Vehicle (99%) 15m

Vehicle (98%) 20m

Vehicle (100%) 10m

Vehicle (100%) 14m

Vehicle (99%) 18m

Cyclist (99%) 16m

Cyclist (99%) 14m

Pedestrian (99%) 7m

Pedestrian (99%) 7m

What does a car need to see?

Object Tracking

Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy.

Vehicle (99%) 15m ID: 5 (95 frames)

Vehicle (98%) 20m ID: 4 (140 frames)

Vehicle (100%) 10m ID: 1 (135 frames)

Vehicle (100%) 14m ID: 2 (140 frames)

Cyclist (99%) 16m ID: 6 (90 frames)

Cyclist (99%) 14m ID: 7 (95 frames)

Pedestrian (99%) 7m ID: 8 (60 frames)

Pedestrian (99%) 7m ID: 9 (60 frames) Vehicle (99%)

18m ID: 3 (140 frames)

What does a car need to see?

Free Space & Driveable Area

Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy.

Cyclist (99%) 16m ID: 6 (90 frames)

Cyclist (99%) 14m ID: 7 (95 frames)

Pedestrian (99%) 7m ID: 8 (60 frames)

Pedestrian (99%) 7m ID: 9 (60 frames) Vehicle (99%)

15m ID: 5 (95 frames)

Vehicle (98%) 20m ID: 4 (140 frames)

Vehicle (100%) 10m ID: 1 (135 frames)

Vehicle (100%) 14m ID: 2 (140 frames)

Vehicle (99%) 18m ID: 3 (140 frames)

What does a car need to see?

Lane Recognition

Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy.

Cyclist (99%) 16m ID: 6 (90 frames)

Cyclist (99%) 14m ID: 7 (95 frames)

Pedestrian (99%) 7m ID: 8 (60 frames)

Pedestrian (99%) 7m ID: 9 (60 frames) Vehicle (99%)

15m ID: 5 (95 frames)

Vehicle (98%) 20m ID: 4 (140 frames)

Vehicle (100%) 10m ID: 1 (135 frames)

Vehicle (100%) 14m ID: 2 (140 frames)

Vehicle (99%) 18m ID: 3 (140 frames)

Audi https://www.slashgear.com/man-vs-machine-my-rematch-against-audis-new-self-driving-rs-7-21415540/

BMW + Intel https://newsroom.intel.com/news-releases/bmw-group-intel-mobileye-will-autonomous-test-vehicles-roads-second-half-2017/

Ford http://cwc.ucsd.edu/content/connected-cars-long-road-autonomous-vehicles

Today's autonomous cars require a lot of computing hardware!

…and perception is the most computationally-intensive part of the software stack

Big computers = expensive cars

As a workaround, companies want people to share autonomous vehicles to amortize hardware costs

As a workaround, companies want people to share autonomous vehicles to amortize hardware costs

Shared autonomous vehicles will likely have some of the downsides as public transportation

Will better computer chips make autonomous cars affordable?

Will better computer chips make autonomous cars affordable?

Deep Learning Processors have arrived!

[1] https://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf [2] http://www.nvidia.com/content/PDF/Volta-Datasheet.pdf (PCIe version)

T H E S E RV E R S I D E

Platform Computation (GFLOPS/s)

Memory Bandwidth

(GB/s)

Computation-to-bandwidth

ratio

Power (TDP Watts)

Year

NVIDIA K20 [1] 3500

(32-bit float) 208 (GDDR5)

17 225 2012

NVIDIA V100 [2] 112000

(16-bit float) 900 (HBM2)

124 (yikes!)

250 2018

Uh-oh… Processors are improving much faster than Memory.

Deep Learning Processors have arrived!

[1] https://indico.cern.ch/event/319744/contributions/1698147/attachments/616065/847693/gdb_110215_cesini.pdf [2] https://www.androidauthority.com/huawei-announces-kirin-970-797788 [3] https://blogs.nvidia.com/blog/2018/01/07/drive-xavier-processor/ [4] https://developer.nvidia.com/jetson-xavier

M O B I L E P L AT FO R MS

Device Cores Computation (GFLOPS/s)

Memory Bandwidth

(GB/s)

Computation-to-bandwidth

ratio

System Power (TDP Watts)

Year

Samsung Galaxy Note 3

Arm Mali T-628 GPU [1]

120 (32-bit float)

12.8 (LPDDR3)

9.3 ~10 2013

Huawei P20

Kirin 970 NPU [2] 1920

(16-bit float) 30 (LPDDR4X)

64 (ouch!)

~10 2018

NVIDIA Jetson Xavier [3,4]

NVIDIA Tensor Cores

30000 (832 int)

137

218 (yikes!)

10 to 30 (multiple modes)

2018

What will the next generation Deep Learning servers look like?

https://medium.com/@shan.tang.g/a-list-of-chip-ip-for-deep-learning-48d05f1759ae

What will the next generation Deep Learning servers look like? 2 0 TO P/ W CO MP U TAT I O N

Platform Efficiency (TOP/s/W)

Computation (TOP/s)

Memory Bandwidth

(TB/s)

Computation-to-bandwidth

ratio

Power (TDP Watts)

Year

NVIDIA K20 [1] 0.015 3.50

(32-bit float) 0.208 (GDDR5)

17 225 2012

NVIDIA V100 [2] 0.45 112

(16-bit float) 0.900 (HBM2)

124 250 2018

Next-gen: 20 TOP/W 20 2500* 1.800

(HBM3) [3] 1389 (oh no!)

250 2020 (est.)

[1] https://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf [2] http://www.nvidia.com/content/PDF/Volta-Datasheet.pdf (PCIe version) [3] https://www.eteknix.com/gddr6-hbm3-details-emerge/

* Assuming half the power is spent on computation, and the other half is spent on memory and other devices. 20 TOP/s/W * 20W * 0.5 = 2500 TOP/s

Small Neural Nets to the rescue

squeeze (verb): to make an AI system use less resources using whatever means necessary

squeeze (verb): to make an AI system use less resources using whatever means necessary

Memory Footprint

and Bandwidth

Computational Operations Power

and Energy Time

squeeze (verb): to make an AI system use less resources using whatever means necessary

Memory Footprint

and Bandwidth

Computational Operations Power

and Energy Time

New DNN Models

Application-specific

Quantization and Pruning

Superior Implementations

Differentiated Data and Training

Strategies

Most CV Applications Rely on Only a Few Core CV Capabilities

Image Classification

Object Detection

Semantic Segmentation

And the best accuracy for each of these capabilities is given by Convolutional Neural Nets

But We Need a Very Different Kind of DNN

DGX-1, 170 TFLOPS, 3.2 KWatts,

128 GB Memory

TitanX: 11 TFLOPS, 223 Watts,

12 GB Memory

VGG16[1] model: - Parameter size: 552 MB - Memory: 93 MB/image - Computation: 15.8 GFLOPs/image

[1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.

Smartphones 100's GFLOPs

3 Watts 2-4 GB

IOT Devices 100's MHz

<1Watt <1 GB

Speed more Related to Memory Accesses than Operations

L1 D-Cache (per core)

L2 Cache (shared)

Off-chip DRAM

Size 32 KB 2 MB 4 GB

Read Latency 4 cycles 22 cycles ~200 cycles

Read Bandwidth 20.8 GB/s 166.4 GB/s 28.7 GB/s

L1 Cache/TLB

L2 Cache Galaxy S7

Samsung Exynos M1 Access Times

Energy More Related to Memory Accesses than operations (45nm 0.9V)

0 20 40 60 80 100

Energy (pJ)

18.5x

100x

10,000x

5.5x

500x

0 500 1000 1500 2000

8b INT Mult

16b FP Mult

32b FP Mult

64b Cache Read (32KB)

64b Cache Read (1MB)

DRAM

Mark Horowitz, “Computing’s Energy Problem (and what we can do about it),” ISSCC 2014

10,000 DNN Architectural Configurations Later: SqueezeNet (2016)

[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS2012 [2] Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 1MB model size." arXiv: 1602.07360 (2016). (February 2016)

CNN Top-5 Accuracy ImageNet

Model Parameters

Model Size

AlexNet[1] 80.3% 60M 243MB

SqueezeNet[2] 80.3% 1.2M 4.8MB

AlexNet [1]

SqueezeNet [2]

compresses to 500KB

SqueezeNet: Immediate Success in Embedded Vision

Enabled embedded processor vendors (ARM, NXP, Qualcomm) to demo CNNs Quickly ported to all the major Deep Learning Frameworks

NXP – Embedded Vision Summit

Qualcomm – Facebook F8

Apple CoreML

SqueezeDet for Object Detection (2017)

Bounding boxes

Final detections Input

image

Best Paper Award: Bichen Wu, Forrest Iandola, Peter H. Jin, and Kurt Keutzer. 2017. SqueezeDet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. In Proceedings, CVPR Embedded Computer Vision Workshop, July 2017.

Filter Conv Det

feature map

• ~2M model parameters • 57 FPS • 1.4 Joules Frame

SqueezeSeg: Semantic Segmentation for LIDAR (2018)

LIDAR point cloud segmentation SqueezeSegV2: • Higher accuracy: v1[1]: 64.6% -> v2[2]: 73.2% (+8.6%) • Better Sim2Real performance: v1[1]: 30% -> v2[2]: 57.4% (+27.4%)

• Outperforms v1 trained on real data w/o intensity

[1] Wu, Bichen, et al. "Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud." ICRA18 [2] Wu, Bichen, et al. "SqueezeSegV2: Improved Model Structure and Unsupervised Domain Adaptation for Road-Object Segmentation from a LiDAR Point Cloud." arXiv:1809.08495 (2018).

Squeeze Family

Image Classification

Object Detection

Semantic Segmentation

SqueezeNet

SqueezeNext

ShiftNet

SqueezeDet SqueezeSeg-{v1, v2} DiracDeltaNet

DNASNet

Andrew Howard's MobileNets: Efficient On-Device Computer Vision Models

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications MobileNet V2: Inverted Residuals and Linear Bottlenecks

Designed for efficiency on mobile phones. Family of pareto optimal models to target needs of the user. V1 based on Depthwise Separable Convolutions. V2 introduces Inverted Residuals and Linear Bottlenecks. Supports Classification, Detection, Segmentation and more.

Model Compression

≥50X 10X

Slide Credit: Prof. Warren Gross (McGill Univ.)

DNN Architecture Search

Anatomy of a convolution layer

⨷ 384

13 13

384

13 13

… …

384

3x3 conv

384

3 3 13

13

… … …

13

13

Filters: Kernel Reduction

⨷ 384

13 13

384

13 13

… …

384

3x3 conv

384

3 3 13

… … …

13

13

3

3

1

1

9x reduction in model parameters

Filters/Channel Reduction

⨷ 384

13 13

384

13 13

… …

384

3x3 conv

384

3 3 13

… … …

13

13 3x3 conv

384

384

3 3

128

128

3 3

9x reduction in model parameters

Model Distillation/Compression

Model Distillation

Li, et al. Mimicking Very Efficient Network for Object Detection. CVPR, 2017.

Examples of what's on a DNN Architect's Palette

Spatial Convolution e.g. 3x3

Shift Channel Shuffle

Depthwise Convolution Pointwise Convolution 1x1

The Art of Small Model Design Small Neural Nets Are Beautiful – ESWeek 2017

The palette of an adept mobile/embedded DNN designer has grown very rich! Overall architecture: economize on layers while retaining accuracy Layer types

Kernel reduction: 5 x 5 3 x 3 1 x 1 Channel reduction: e.g. FireLayer Experiment with novel layer types that consume no FLOPS

Shuffle Shift

Model distillation: let big models teach smaller ones Apply pruning Tailor bit precision (aka quantization) to target processor

Iandola, Forrest, and Kurt Keutzer. "Small neural nets are beautiful: enabling embedded systems with small deep-neural-network architectures." In Proceedings of the Twelfth International Conference on Hardware/Software Codesign and System Synthesis Companion, p. 1. ACM, 2017. (ESWeek 2017). Also, (arXiv:1710.02759)

Artistic/Engineering Process of Designing a Deep Neural Net

• Manual design: • Each iteration to evaluate a point in the design space is very expensive • Exploration limited by human imagination

Can we automate this?

• Manual design: • Each iteration to evaluate a point in the design space is very expensive • Exploration limited by human imagination

DNAS: Differentiable Neural Architecture Search

Differentiable Neural Architecture Search: • Extremely fast: 8 GPUs, 24 hours

• Can search for different conditions case-by-case • Optimize for actual latency Bichen Wu,

Kurt Keutzer, Peizhao Zhang,

Yanghan Wang,

Fei Sun, Yiming Wu,

Yuandong Tian, Peter Vajda, Yangqing Jia

DNAS in context (FLOPs to normalize comparison)

MobileNetV2: [4] Acc: 71.8%, FLOPs: 300M

More FLOPs - BAD

ImageNet top-1 Accuracy -- Good PNAS: [2] Acc: 74.2%, FLOPs: 588M Search cost*: 6,000 GPU-hrs

DARTS: [3] Acc: 73.1%, FLOPs: 595M Search cost: 288 GPU-hrs

AMC: [5] Acc: 70.8%, FLOPs: 150M

MnasNet: [6] Acc: 74.0, FLOPs: 317M Search Cost*: 91,000 GPU-hrs

NAS: [1] Acc: 74.0%, FLOPs: 564M Search cost: 48,000 GPU-hrs

* Estimated from the paper description

[1] Zoph, Barret, et al. "Learning transferable architectures for scalable image recognition." arXiv preprint arXiv:1707.070122.6 (2017). [2] Liu, Chenxi, et al. "Progressive neural architecture search." arXiv preprint arXiv:1712.00559 (2017). [3] Liu, Hanxiao, Karen Simonyan, and Yiming Yang. "Darts: Differentiable architecture search." arXiv preprint arXiv:1806.09055 (2018) [4] Sandler, Mark, et al. "MobileNetV2: Inverted Residuals and Linear Bottlenecks.” CVPR18 [5] He, Yihui, et al. "Amc: Automl for model compression and acceleration on mobile devices." Proceedings of the European Conference on Computer Vision (ECCV). 2018. [6] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." arXiv preprint arXiv:1807.11626 (2018).

DNASNet: (ours) Acc: 74.2%, FLOPs: 295M Search Cost: 216 GPU-hrs

• X-axis: FLOPs • Y-axis: accuracy • Mark size: search cost • Circles: search cost unknown

DNAS for device-aware search

NET Latency on iPhoneX

Latency on Samsung S8

Top-1 acc

DNAS-iPhoneX 19.84 ms 23.33 ms (20% slower)

73.20%

DNAS-S8 27.53 ms (25% slower)

22.12 ms 73.27%

• For different targeted devices, both DNASNets achieve similar accuracy.

• However, per target DNN optimization was required

The Future: Breaking down the wall between DNN Design & Hardware Design

DNN Designers • Unaware of arithmetic intensity • Floating point vs fixed point costs

• Memory hierarchy and latency

NN HW Accelerator architects • Using outdated models:

- AlexNet

- VGG16 • Using irrelevant datasets:

- MNIST

- CIFAR

Key Takeaways

• Autonomous vehicles currently need thousands (or even hundreds of thousands) of dollars of computing hardware

• Processing is on a trajectory of rapid improvement (in operations-per-Watt) • but other aspects of the system (e.g. memory) are improving much more slowly • today's neural networks will be choked by slow memory on tomorrow's DNN accelerators (this is

already happening and will get worse)

• Designing new (smaller) neural networks helps with all of the following • making full use of next-generation computing platforms • reducing the hardware costs in autonomous vehicles • enabling lower-cost, larger-scale rollouts of autonomous vehicles