Graphics Processing Unit (GPU) Acceleration of Machine...

18
MIT Space Systems Laboratory SSL Seminar 0/17 Graphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications Workshop on Space Flight Software November 6, 2009 Brent Tweddle Massachusetts Institute of Technology Space Systems Laboratory

Transcript of Graphics Processing Unit (GPU) Acceleration of Machine...

Page 1: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 0/17

Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

for Space Flight Applications

Workshop on Space Flight Software

November 6, 2009

Brent Tweddle Massachusetts Institute of Technology

Space Systems Laboratory

Page 2: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 1/17

Machine Vision in Space CSA Space Vision System DARPA Orbital Express AVGS JSC Sprint AERCam & Mini AERcam

GSFC Hubble Robotic Repair

NRL SUMO FREND

JPL Mars Exploration Rovers MIT SSL SPHERES

Page 3: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 2/17

MER Driving Speeds

Flight Processor

•  20 MHz RAD6000 CPU

•  128 MB DRAM •  VxWorks Operating System

•  Memory Space & Cache Shared by 97 other tasks

Mode Speed

Manual Driving 124 m/hr

AutoNav (safe terrain) 36 m/hr

AutoNav (obstacles) 10 m/hr

Visual Odometry 10 m/hr

Visual Odometry + AutoNav 5 m/hr

[1] J. J. Biesiadecki, C. Leger, and M. W. Maimone. Tradeoffs between directed and autonomous driving on the mars exploration rovers. In S. Thrun, R. A. Brooks, and H. F. Durrant-Whyte, editors, ISRR, volume 28 of Springer Tracts in Advanced Robotics, pages 254–267. Springer, 2005. [2] M. W. Maimone, A. E. Johnson, Y. Cheng, R. G. Willson, and L. Matthies. Autonomous Navigation Results from the Mars Exploration Rover (MER) Mission. In M. H. A. Jr. and O. Khatib, editors, ISER, Springer Tracts in Advanced Robotics, pages 3–13. Springer, 2004.

13 second 0.5 m drive 70 second compute

Page 4: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 3/17

Overview

•  Characteristics of Vision Algorithms –  Parallelism and locality

•  Hardware Architecture –  CPU

–  FPGA

–  GPU

•  GPU Programming Model

•  Initial Performance Results and Comparison

•  Path to space operations

•  Conclusions

Page 5: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 4/17

Stereo Depth Map

Left Stereo Image

Right Stereo Image Stereo Disparity Map

Minimize Windowed Sum of Squared Differences over d

d

Characteristics • 2D Spatial Locality • Read or Write • Data Parallel • Minimal Branching & Instruction Complexity

Page 6: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 5/17

Cache Fundamentals

0 1 2 3 4 5 6 7 8 9 A B C D E F

4 C

5 D

6 E

7 F

Main Memory

2 Block Cache

• Principle of Locality

• Temporal Locality: • Data that has been recently accessed will likely be accessed again in the future

• Spatial Locality • Data that is near recently accessed data will likely be accessed in the future

!

""#

0 1 2 34 5 6 78 9 A BC D E F

$

%%&

Matrix Data

Cache: Smaller, faster, more expensive memory that mirrors data that is likely to be used in the future

Page 7: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 6/17

2D Spatial Locality: Morton Mapping

0 1 4 5 2 3 6 7 8 9 C D A B E F

2 A

3 B

6 E

7 F

Main Memory

2 Block Cache

• 2D Principle of Locality

• Optimized for Two dimensional applications • Currently “Implemented” as texture cache in GPU’s, used in machine vison applications • Could be implemented on standard CPU’s, but need a remap procedure that will generate cache missed • Translation from x-y to Morton Mapping address is computationally more expensive

!

""#

0 1 2 34 5 6 78 9 A BC D E F

$

%%&

Matrix Data

Page 8: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 7/17

Data Parallel Visual Navigation Algorithms

•  Estimation – Particle Filter

•  2D Image Processing – Disparity Map

– Kernel Filtering

•  3D Data Processing –  Iterative Closest Point

•  Path Planning – Rapidly Exploring Random Trees

Page 9: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 8/17

CPU Architecture

•  Pentium 4 Willamette –  Released Nov 2000

–  1.3 to 2.0 GHz –  256 kB cache

–  Total Power @ 1.6 GHz: 60.8 W

–  L1 miss: 2 cycles

–  L2 miss: 7 cycles

[1] W. Wu, L. Jin, J. Yang, P. Liu, and S. X. D. Tan. A Systematic Method For Functional Unit Power Estimation in Microprocessors. In Design Automation Conference, 2006.

Page 10: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 9/17

FPGA Architecture

•  Programmable logic implemented as look up tables

•  Incorporates on-chip memory and DSP blocks

•  Implemented using VHDL or Verilog to describe logic

•  Development and testing is very difficult

•  Less power efficient than a custom ASIC

Altera Stratix Look Up Table

Page 11: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 10/17

NVIDIA GPU Architecture

Architecture Designed for Data Parallel Applications

Programming Model: Single Program, Multiple Data

Page 12: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 11/17

GPU’s for Embedded Systems

Processor Theoretical Peak GFLOPS

Watts Watts per GFLOPS

Quad “Bloomfield” Xeon 3.2 GHz

25.6 GFLOPS 130 W 5.078

Core 2 Duo “Penryn” 2.53 GHz

20.2 GFLOPS 25 W 0.810

Cell Processor 152 GFLOPS 80 W 0.526

NVIDIA Tesla C870 518 GFLOPS 170 W 0.328

NVIDIA GeForce 9800 GT

504 GFLOPS 105 W 0.208

NVIDIA GeForce 8800M GTS

240 GFLOPS 35 W 0.145

•  Assumptions: •  Xeon issues 2 flops per cycle per core

•  Core2Duo issues 4 flops per cycle per core

http://icl.cs.utk.edu/hpcc/hpcc_desc.cgi?field=Theoretical%20peak

Page 13: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 12/17

Mip-Mapping & Texture Cache

•  GPU’s have hardware to support mapping textures onto 3D objects

–  2D Spatial Locality

–  High throughput

–  Low latency

•  Data is stored as a Mip-Map in Texture Cache –  Hardware supports sub-pixel interpolation

–  Morton Access Pattern

Mip-Mapped Texture

Stored Texture

Rendered Scene

Page 14: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 13/17

CUDA Programming Model •  Single Program, Multiple Data in “C”

–  Same instruction issued to 8 threads (context & data)

•  Parallel Execution with no guarantee of order –  Race conditions & deadlocks are possible

–  Synchronization and mutual exclusion is necessary

•  Direct control on on-chip memory (memory read is 100s of cycles) –  Implement custom caching protocols

•  Maximizing performance is challenging –  Aligned Memory access

–  Resource Utilization __global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { int tx = threadIdx.x; int ty = threadIdx.y;

int Mcols = M.width; int Ncols = N.width;

float sum = 0; for(int i = 0; i < Mcols; ++i) {

float a = M.elements[tx * Mcols + i]; float b = N.elements[i * Ncols + ty]; sum += a * b;

} int index = tx * Ncols + ty; P.elements[index] = sum; }

O(n^3/p)

Page 15: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 14/17

Initial GPU Stereo Results

•  Implemented stereo disparity map on GPU with LR Consistency Check based on NVIDIA original code

–  25 ms for a 640x480 frame

•  Optimized algorithms for CPU SIMD hardware –  512x512: <0.1s Van der Mark, Gavrila, “Real-Time Dense Stereo

for Intelligent Vehicles”, IEEE Trans. ITS, 2006

Page 16: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 15/17

Path To Space

•  Future Research and Development –  COTS GPU Implementation of Navigation Algorithms

•  Do they work well in practice?

–  Development of embedded system architectures –  Should we:

•  Radiation harden a COTS GPU

•  Or build a rad-hard GPU-like ASIC?

–  Software testing of parallel algorithms?

–  ESA’s architecture: •  Primary Flight Computer to monitor Accelerator for errors

Primary Flight

Computer

GPU Vision Accelerator

Page 17: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 16/17

Summary & Conclusions

•  Discussed Characteristics of Machine Vision Algorithms

•  Identified need for faster and more power efficient processing architectures

•  GPU Architecture matches well with Machine Vision –  2D Locality Texture Caches

–  Data Parallel SPMD Programming Model –  Minimal Branching and Instruction Complexity Reduced control

hardware

•  Initial Performance Tests show promise

•  Significant work ahead

Page 18: Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

MIT Space Systems Laboratory SSL Seminar 17/17

Questions & Acknowledgements