Computer Vision Algorithm Acceleration Using GPGPU · Boeing Research & Technology | GPGPU GPGPU...
Transcript of Computer Vision Algorithm Acceleration Using GPGPU · Boeing Research & Technology | GPGPU GPGPU...
Boeing Research & Technology
Engineering, Operations & Technology
Copyright © 2016 Boeing. All rights reserved.
Computer Vision Algorithm Acceleration Using GPGPU and the Tegra Processor's Unified Memory
Aaron Mosher
Boeing Research & Technology
Avionics Systems Technology
Mosher, 4/1/2016, S6141 | 1GPU Technology Conference 2016 S6141 RROI # 16-00239-EOT
MODAT Ocean Search
GPU Technology Conference Note to organizers:
This presentation has been approved for release, but still requires Boeing IPM to issue
an assignment of copyright before these slides can be disseminated on conference’s
website (expected to receive IPM approval on Monday April 4th)
Copyright © 2016 Boeing. All rights reserved.
Boeing Research & Technology | GPGPU
Background:
Computer Vision algorithms give us the capability to make vehicles smarter (better automation); but they are often limited by computation power available.
Mosher, 4/1/2016, S6141 | 2
~ 4 KW
DARPA Grand Challenge (2004/2005) CMU RedTeam.
shock-mounted Electronics enclosure, multiple rack-mount
computers with 5KW aux generator and cooling [1]
3-4 KW<10 W
cSWAP:
▪ Cost, Size, Weight, Power
▪ Usually focused on Thermal Dissipation (power)
“I can supply the power, but how do I dissipate the heat?”
▪ “Great idea, now how do I fit it on my Vehicle ?”
▪ Image/signal processing
▪ Deep Learning
▪ Sensor Data into actionable Decisions
Copyright © 2016 Boeing. All rights reserved.
Boeing Research & Technology | GPGPU
Why explore GPGPU?
Mosher, 4/1/2016, S6141 | 3
Performance vs Power vs Cost:
▪ CPU speeds not getting faster since 2005 [2],[3]; but transistor count still growing(SIMD,multi-core, etc.)
▪ Push towards concurrency, parallel operations
illustrative trend data based on:
http://cpudb.stanford.edu/visualize/clock_frequency
▪ Computationally expensive algorithms
▪ Miniaturization for deployment
▪ Traditional paths: FPGA, ASIC
▪ Long development / adaptation time
▪ “Speed costs money, how fast do you want to go”?
▪ Can GPGPU can provide “a faster and easierway to get there” ?
Laptop Computer
system on a chip/module
Copyright © 2016 Boeing. All rights reserved.
Boeing Research & Technology | GPGPU
Algorithms
Mosher, 4/1/2016, S6141 | 4
▪ Image Processing or Signal Processing is a good fit
▪ Image manipulation (GPU’s already good at this)
– For each pixel: do <<something>>
▪ Signal processing
– cuFFT, CUBLAS, etc.
▪ Large data-sets, same operation for each element
▪ Structure of the algorithm
▪ Structure for Parallelization?
▪ Some places have lots of branch divergence:
– while loops, if-then-else, open-ended iterations
– Leave these in CPU, utilize streaming and asyncoperations to maximize performance around them
▪ Adaptation effort vs throughput
▪ How much time do you want to spend in re-writingvs how fast does it need to run for your needs?
Copyright © 2016 Boeing. All rights reserved.
Boeing Research & Technology | GPGPU
First steps:Profile and understand data flow before you start
Mosher, 4/1/2016, S6141 | 5
Function
A
Function B
Function
C
Function
D
▪ Map out a flowchart of execution flow and data dependency
▪ a flowchart of data dependency helps you design concurrency
▪ Usually software is written in a linear (imperative) style, but data-dependency graph can help you determine areas for concurrency
Starting from an existing (reference) algorithm?
▪ Profile the current run-time performance
▪ Find bottlenecks in current throughput, understand where to spend your time wisely
Copyright © 2016 Boeing. All rights reserved.
Boeing Research & Technology | GPGPU
Initial efforts and results
Mosher, 4/1/2016, S6141 | 6
Initial Cut at replacing functions with CUDA kernels
Structure of the algorithm led to suboptimal utilization of Kernels.
Restructured application to keep GPU pipeline full of compute tasks.
Identified further areas of optimization needed (some Kernels taking too long)
Copyright © 2016 Boeing. All rights reserved.
Boeing Research & Technology | GPGPU
GPGPU Pipeline Optimization: AfterGPU Pipeline is kept full with processing, no “air gaps” where it stalls.
Mosher, 4/1/2016, S6141 | 7
In this case, most processing moved up into device (GPU) except for some processing at the end of the cycle.
Some algorithms may require intermediate steps of data movement between GPU & CPU. Use overlapped operations [6].
What is missing from this Graph?
The CPU.
Its doing nothing.
Further work: restructure algorithm to utilize both GPU and CPU concurrently.
Copyright © 2016 Boeing. All rights reserved.
Boeing Research & Technology | GPGPU
MODAT adaptation results:
Mosher, 4/1/2016, S6141 | 8
Before: MFC application on Intel i7
Intel Performance Primitives
After: OpenCV + CUDA
on the Tegra X1
(click to play video)
Copyright © 2016 Boeing. All rights reserved.
Boeing Research & Technology | GPGPU
Mosher, 4/1/2016, S6141 | 9
Results / performance: (MODAT)
▪ Before our adaptation, this algorithm ran at:
▪ ~ 13.8 frames per second (older C/C++ algorithm using MFC) on M4800
▪ ~ 36.6 frames per second when adapted to use IPP (Intel Performance Primitives) on M4800
Watts
Compared to an optimized IPP implementation: a CUDA implementation on X1 is
5.7x improvement in performance-per-watt.
Compared to simple C++ algorithm: a CUDA implementation on X1 is
11x improvement in performance-per-watt.
~3 to 4 person-months of development effort
Throughput on TX1 approaches (but does not match) the
laptop (Dell M4800) using CUDA or IPP. However, a ~7:1
reduction in power dissipation.
After adaptation (OpenCV + CUDA) it ran at:~ 14 frames per second (TK1)
~ 26 frames per second (TX1)
~ 29 frames per second on M4800 (Quadro K2100m)
Copyright © 2016 Boeing. All rights reserved.
Boeing Research & Technology | GPGPU
Unified Memory
Mosher, 4/1/2016, S6141 | 10
▪ Physically combined memory on the Tegra K1 & X1
▪ Portable to other GPU platforms, data-migration happening from Host/Device but is hidden
▪ Initial testing with K1 was disappointing, slow memory access speeds (cache problem?)
▪ Testing with X1 was encouraging, as far as I can tell unified memory incurs no penalty compared to traditional Global memory.
▪ The real value of Unified Memory, in our experience, is that it removes barriers to adaptation / conversion:
▪ Ease of programming /adaptation: existing algorithms had complicated C++ classes, object oriented structures, deep-copy operations for data [4]
– Object Oriented programming is common design.
GPU:
CPU:
▪ Take advantage in conjunction with overlapped operations and CUDA streams [6]
▪ Our Ocean Search algorithm does this, ~15% speedup when implemented
Copyright © 2016 Boeing. All rights reserved.
Boeing Research & Technology | GPGPU
Unified Memory benchmarking:
Based on BoxFilter sample, reversing pixels per-row in the image (mirror left-right)
Mosher, 4/1/2016, S6141 | 11
▪ On K1, depends on access patterns, data movement
In these examples, “Memory Reversal” benchmark includes
both upload & download times
GFLOP/s are an estimate and relative measurement only,
not an optimized algorithm for maximum performance
Tegra K1
Tegra X1
Faster on X1 (Shield TV and Jetson TX1).
Around 2x faster memory
Copyright © 2016 Boeing. All rights reserved.
Boeing Research & Technology | GPGPU
Conclusions and lessons-learned
Mosher, 4/1/2016, S6141 | 12
What we learned:
▪ Reference algorithm (CPU only): start with profiler before jumping-in to CUDA adaptation
▪ Adaptation effort vs runtime performance gained
▪ use Nsight Profiler; chose optimizations carefully to maintain development schedule
What went right:
▪ Was able to provide a computer-vision capability within a size/weight/power not previously possible.
What went wrong:
▪ Some experiments in Kernel optimization took extra development time, but yielded no appreciable runtime improvement. “80% / 20%” rule ?
Profile often to understand where to spend effort
Copyright © 2016 Boeing. All rights reserved.
Boeing Research & Technology | GPGPU
References and Credits:
Mosher, 4/1/2016, S6141 | 13
[1] “A Robust Approach to High-Speed Navigation for Unrehearsed Desert Terrain” Chris Urmson, Charlie Ragusa, and David Ray, Journal of Field Robotics 23(8), 467–508 (2006)
[2] “CPU DB: Recording Microprocessor History” Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University. ACM Queue, April 6, 2012. Volume 10 issue 4.
https://queue.acm.org/detail.cfm?id=2181798
[3] http://cpudb.stanford.edu
[4] https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/
[5] http://www.alexstjohn.com/WP/2014/01/16/porting-cuda-6-0/ (Alex St. John)
[6] https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/
Copyright © 2016 Boeing. All rights reserved.
Boeing Research & Technology | GPGPU
Questions and Answers time:
Mosher, 4/1/2016, S6141 | 14
[1]