Computer Vision Algorithm Acceleration Using GPGPU · Boeing Research & Technology | GPGPU GPGPU...

Boeing Research & Technology

Engineering, Operations & Technology

Copyright © 2016 Boeing. All rights reserved.

Computer Vision Algorithm Acceleration Using GPGPU and the Tegra Processor's Unified Memory

Aaron Mosher

Boeing Research & Technology

Avionics Systems Technology

Mosher, 4/1/2016, S6141 | 1GPU Technology Conference 2016 S6141 RROI # 16-00239-EOT

MODAT Ocean Search

GPU Technology Conference Note to organizers:

This presentation has been approved for release, but still requires Boeing IPM to issue

an assignment of copyright before these slides can be disseminated on conference’s

website (expected to receive IPM approval on Monday April 4th)


Boeing Research & Technology | GPGPU

Background:

Computer Vision algorithms give us the capability to make vehicles smarter (better automation); but they are often limited by computation power available.

Mosher, 4/1/2016, S6141 | 2

~ 4 KW

DARPA Grand Challenge (2004/2005) CMU RedTeam.

shock-mounted Electronics enclosure, multiple rack-mount

computers with 5KW aux generator and cooling [1]

3-4 KW<10 W

cSWAP:

▪ Cost, Size, Weight, Power

▪ Usually focused on Thermal Dissipation (power)

“I can supply the power, but how do I dissipate the heat?”

▪ “Great idea, now how do I fit it on my Vehicle ?”

▪ Image/signal processing

▪ Deep Learning

▪ Sensor Data into actionable Decisions



Why explore GPGPU?

Mosher, 4/1/2016, S6141 | 3

Performance vs Power vs Cost:

▪ CPU speeds not getting faster since 2005 [2],[3]; but transistor count still growing(SIMD,multi-core, etc.)

▪ Push towards concurrency, parallel operations

illustrative trend data based on:

http://cpudb.stanford.edu/visualize/clock_frequency

▪ Computationally expensive algorithms

▪ Miniaturization for deployment

▪ Traditional paths: FPGA, ASIC

▪ Long development / adaptation time

▪ “Speed costs money, how fast do you want to go”?

▪ Can GPGPU can provide “a faster and easierway to get there” ?

Laptop Computer

system on a chip/module



Algorithms

Mosher, 4/1/2016, S6141 | 4

▪ Image Processing or Signal Processing is a good fit

▪ Image manipulation (GPU’s already good at this)

– For each pixel: do <<something>>

▪ Signal processing

– cuFFT, CUBLAS, etc.

▪ Large data-sets, same operation for each element

▪ Structure of the algorithm

▪ Structure for Parallelization?

▪ Some places have lots of branch divergence:

– while loops, if-then-else, open-ended iterations

– Leave these in CPU, utilize streaming and asyncoperations to maximize performance around them

▪ Adaptation effort vs throughput

▪ How much time do you want to spend in re-writingvs how fast does it need to run for your needs?



First steps:Profile and understand data flow before you start

Mosher, 4/1/2016, S6141 | 5

Function

A

Function B

Function

C

Function

D

▪ Map out a flowchart of execution flow and data dependency

▪ a flowchart of data dependency helps you design concurrency

▪ Usually software is written in a linear (imperative) style, but data-dependency graph can help you determine areas for concurrency

Starting from an existing (reference) algorithm?

▪ Profile the current run-time performance

▪ Find bottlenecks in current throughput, understand where to spend your time wisely



Initial efforts and results

Mosher, 4/1/2016, S6141 | 6

Initial Cut at replacing functions with CUDA kernels

Structure of the algorithm led to suboptimal utilization of Kernels.

Restructured application to keep GPU pipeline full of compute tasks.

Identified further areas of optimization needed (some Kernels taking too long)



GPGPU Pipeline Optimization: AfterGPU Pipeline is kept full with processing, no “air gaps” where it stalls.

Mosher, 4/1/2016, S6141 | 7

In this case, most processing moved up into device (GPU) except for some processing at the end of the cycle.

Some algorithms may require intermediate steps of data movement between GPU & CPU. Use overlapped operations [6].

What is missing from this Graph?

The CPU.

Its doing nothing.

Further work: restructure algorithm to utilize both GPU and CPU concurrently.



MODAT adaptation results:

Mosher, 4/1/2016, S6141 | 8

Before: MFC application on Intel i7

Intel Performance Primitives

After: OpenCV + CUDA

on the Tegra X1

(click to play video)



Mosher, 4/1/2016, S6141 | 9

Results / performance: (MODAT)

▪ Before our adaptation, this algorithm ran at:

▪ ~ 13.8 frames per second (older C/C++ algorithm using MFC) on M4800

▪ ~ 36.6 frames per second when adapted to use IPP (Intel Performance Primitives) on M4800

Watts

Compared to an optimized IPP implementation: a CUDA implementation on X1 is

5.7x improvement in performance-per-watt.

Compared to simple C++ algorithm: a CUDA implementation on X1 is

11x improvement in performance-per-watt.

~3 to 4 person-months of development effort

Throughput on TX1 approaches (but does not match) the

laptop (Dell M4800) using CUDA or IPP. However, a ~7:1

reduction in power dissipation.

After adaptation (OpenCV + CUDA) it ran at:~ 14 frames per second (TK1)

~ 26 frames per second (TX1)

~ 29 frames per second on M4800 (Quadro K2100m)



Unified Memory

Mosher, 4/1/2016, S6141 | 10

▪ Physically combined memory on the Tegra K1 & X1

▪ Portable to other GPU platforms, data-migration happening from Host/Device but is hidden

▪ Initial testing with K1 was disappointing, slow memory access speeds (cache problem?)

▪ Testing with X1 was encouraging, as far as I can tell unified memory incurs no penalty compared to traditional Global memory.

▪ The real value of Unified Memory, in our experience, is that it removes barriers to adaptation / conversion:

▪ Ease of programming /adaptation: existing algorithms had complicated C++ classes, object oriented structures, deep-copy operations for data [4]

– Object Oriented programming is common design.

GPU:

CPU:

▪ Take advantage in conjunction with overlapped operations and CUDA streams [6]

▪ Our Ocean Search algorithm does this, ~15% speedup when implemented



Unified Memory benchmarking:

Based on BoxFilter sample, reversing pixels per-row in the image (mirror left-right)

Mosher, 4/1/2016, S6141 | 11

▪ On K1, depends on access patterns, data movement

In these examples, “Memory Reversal” benchmark includes

both upload & download times

GFLOP/s are an estimate and relative measurement only,

not an optimized algorithm for maximum performance

Tegra K1

Tegra X1

Faster on X1 (Shield TV and Jetson TX1).

Around 2x faster memory



Conclusions and lessons-learned

Mosher, 4/1/2016, S6141 | 12

What we learned:

▪ Reference algorithm (CPU only): start with profiler before jumping-in to CUDA adaptation

▪ Adaptation effort vs runtime performance gained

▪ use Nsight Profiler; chose optimizations carefully to maintain development schedule

What went right:

▪ Was able to provide a computer-vision capability within a size/weight/power not previously possible.

What went wrong:

▪ Some experiments in Kernel optimization took extra development time, but yielded no appreciable runtime improvement. “80% / 20%” rule ?

Profile often to understand where to spend effort



References and Credits:

Mosher, 4/1/2016, S6141 | 13

[1] “A Robust Approach to High-Speed Navigation for Unrehearsed Desert Terrain” Chris Urmson, Charlie Ragusa, and David Ray, Journal of Field Robotics 23(8), 467–508 (2006)

[2] “CPU DB: Recording Microprocessor History” Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University. ACM Queue, April 6, 2012. Volume 10 issue 4.

https://queue.acm.org/detail.cfm?id=2181798

[3] http://cpudb.stanford.edu

[4] https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/

[5] http://www.alexstjohn.com/WP/2014/01/16/porting-cuda-6-0/ (Alex St. John)

[6] https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/



Questions and Answers time:

Mosher, 4/1/2016, S6141 | 14

[1]

Computer Vision Algorithm Acceleration Using GPGPU · Boeing Research & Technology | GPGPU GPGPU...

Documents

Transcript of Computer Vision Algorithm Acceleration Using GPGPU · Boeing Research & Technology | GPGPU GPGPU...