Power-Efficient Medical Image Processing using PUMA

30
University of Michigan Electrical Engineering and Computer Science Power-Efficient Medical Image Processing using PUMA Ganesh Dasika, Kevin Fan 1 , Scott Mahlke 1 Parakinetics, Inc. University of Michigan Advanced Computer Architecture Laboratory

description

Power-Efficient Medical Image Processing using PUMA. Ganesh Dasika , Kevin Fan 1 , Scott Mahlke. University of Michigan Advanced Computer Architecture Laboratory. 1 Parakinetics, Inc. The Advent of the GPGPU. Increasingly popular substrate for HPC Astrophysics Weather Prediction EDA - PowerPoint PPT Presentation

Transcript of Power-Efficient Medical Image Processing using PUMA

Page 1: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science

Power-Efficient Medical Image Processing using PUMA

Ganesh Dasika, Kevin Fan1, Scott Mahlke

1Parakinetics, Inc.

University of MichiganAdvanced Computer Architecture Laboratory

Page 2: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science2

The Advent of the GPGPU• Increasingly popular

substrate for HPC– Astrophysics– Weather Prediction– EDA– Financial instrument pricing– Medical Imaging

Page 3: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science3

Advantages of GPGPUs• High degree of parallelism

– Data-level– Thread-level

• High bandwidth• Commodity products• Increasingly programmable

Page 4: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science4

Disadvantages of GPGPUs• Gap between computation and bandwidth

– 933 GFLOPS : 142 GB/s bandwidth(0.15B of data per FLOP, ~26:1 Compute:Mem Ratio)

• Very high power consumption– Graphics-specific hardware– Multiple thread contexts– Large register files and memories– Fully general datapath

Inefficiencies in allgeneral-purpose architectures

Page 5: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science5

Programmability vs Efficiency?

FPGAs

General PurposeProcessors

DSPsDomain-specific

Accelerators,GPGPUs

Efficiency

Flex

ibilit

y

5

Loop Accelerators,ASICs

???

Highly efficient,some programmability

Page 6: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science6

Medical Image Reconstruction• Compute intensive loops

– 32-bit floating point code– High data/bandwidth requirements

• Increased demand for portability, low power• Much current research focuses on using GPGPUs

for this domain

Page 7: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science7

CT Image reconstruction• X-Ray emitters and

receptors on opposite sides of patients

• Received x-ray intensity corresponds to tissue density

• Multiple scans (“slices”) taken around patient put together to reconstruct 1 2D-image

Page 8: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science8

Projection & Sinogram

Sinogram:All projections

Projection:All ray-sums in a direction

P(t)

f(x,y)

t

y

x

X-raysSinogram

t

p

Page 9: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science9

Example: BackprojectionSinogram Backprojected Image

Page 10: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science10

Example:Filtered Backprojection

Filtered Sinogram Reconstructed Image

Page 11: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science11

Reconstruction: Solve for m’s

m11 m12 m13 m14

m21 m22 m23 m24

m31 m32 m33 m34

m41 m42 m43 m44

16 22 11 10

X-RayEmitter

DetectorValues

Densities

“Human Body“

22

12

10

15

Page 12: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science12

Real Reconstruction Problem• Intensity measured • Rays transmitted

through multiple “pixels”

• Find individual “pixel” values from transmission data

? ? ? ?? ?

? ? ? ?? ?

? ? ? ?? ?

? ? ? ?? ?

? ? ? ?? ?

? ? ? ?? ? 534

417

364

555

501

355

255712

199

512 values

512values

100’s of diagonals @

100’s of angles

Page 13: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science13

Medical Imaging Applications

• Image reconstruction for MRI/CT/PET scans• Large amounts of Vector/Thread-level parallelism• FP-intensive kernels

– Often requiring math library functions• Data-intensive (~5:1 compute:mem ratio)

Benchmark Inner-loop%Scalar/Vector Outer-loop TLP Compute:Mem

ratio

Segmentation Fully vectorizable Do-all 4:1

Laplacian Filtering Fully vectorizable Do-all 3:1Gaussian

ConvolutionFully vectorizable with predicates Do-all 6:1

MRI FH Vector Fully vectorizable Do-all 6:1

MRI Q Vector Fully vectorizable Do-all 5.5:1

Page 14: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science14

• Currently, most scans requiremoving patient to imaging room– Consumes time– Stress on patient

• Studies show benefits of portable, bed-side scanners:– 86% increase in patients suitable for post-stroke thrombolytic

therapy [Weinreb et al, RSNA]– 80-100% drop in scan-related complications

[Gunnarsson et al, J. of Neurosurgery]• New X-Ray emitters push for mAs of current use

Current Concerns: Portability/Power

Page 15: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science15

Current Concerns: Performance• High-accuracy CT algorithms

take too long– Iterative forward/backward

projection– ~Hours on modern CT scanners

instead of minutes• Interventional radiology

– Scans currently takes minutes, but should take seconds

• CT-Flouroscopy– Several scans done in succession

Page 16: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science16

Flexibility• Software algorithms change over time• NRE• Time-to-market

16

Page 17: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science17

PUMA• Tiled architecture• Bandwidth-matched for

improved efficiency• Each tile is a

“Programmable Loop Accelerator” Extern. Interface

CPU Mem Disk …

Page 18: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science18

Programmable Loop Accelerator• Generalize accelerator without losing efficiency

FPGAs

Efficiency, Performance

Flex

ibilit

y

Loop Accelerators,ASICs

ProgrammableLoop Accelerators

18

General PurposeProcessors

DSPsDomain-specific

Accelerators,GPGPUs ???

Page 19: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science19

Designing Loop Accelerators

C Code Loop

19

Hardware

Point-to-point Connections

BR

CRF

+

… …

&

… …

MEM

… …

LocalMem

+

……

*

……

MEM

……

<<

……

LocalMem

Page 20: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science20

Loop Accelerator Architecture

Point-to-point Connections

+

… …

&

… …

MEM

… …

LocalMem

FSM

Controlsignals

CRF

BR

Hardware realization of modulo scheduled loopParameterized hardware:• FUs• Shift Register Files

20

• Static Control• Point-to-point Interconnect

Page 21: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science21

Programmable Loop-Accelerator Architecture

Point-to-point Connections

+/-

… …

&/|

… …

MEM

… …

LocalMem

ControlMemory

Controlsignals

CRF

BR

RR RRRRRR

Literals

Ring

Functionality Storage Connectivity Control

LA PLACustom FU set Generalized FUs + MOVs

Point-to-point Ring + Port-swapping

Limited size, no addr. Rotating Reg. Files

Hardwired Control Lit. Reg. File + Control Mem

21

+ &

SRF SRFSRFSRF

FSM

Page 22: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science22

MRI.FH PLA• ~0.6 mm2 per tile• 38 FUs• 128 32-bit registers• Inter-FU BW 1 TB/sec

FU Type #

FP-ADDSUB 6

FP-MPY 9

I-ADDSUB 8

MEM 9

I-MPY 1

Other 5

Page 23: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science23

Performance on MRI.FH PLA

MRI.FH MRI.Q CT.segment CT.laplace CT.gauss0.0

0.2

0.4

0.6

0.8

1.0

Non-Generalized Generalized

Norm

alize

d Pe

rform

ance

II preserved

II doubled

Unschedulable

Page 24: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science24

Efficiency on MRI.FH PLA

MRI.FH MRI.Q CT.segment CT.laplace CT.gauss mean0.0

0.2

0.4

0.6

0.8

1.0

Non-Generalized Generalized

Norm

alize

d

Perf/

Powe

r Ef-

ficie

ncy

Page 25: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science25

PUMA System Design• 5 systems designed

around 5 benchmarks• Each composed of

identical tiles• Assume same B/W as

GTX280 (142 GB/s)• # Tiles based on B/W

requirements of benchmark

Extern. Interface

CPU Mem Disk …

Page 26: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science26

System Performance

MRI.FH MRI.Q CT.segment CT.laplace CT.gauss0

20406080

100120140160

Theoretical Realized

GOPs

/sec

4W 3W 2.8W 2.3W 2.7W

Page 27: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science27

Performance vs. GPGPU

PUMA GTS 250 GTX 260 GTX 280 GTX 285 GTX 2950.00.20.40.60.81.01.21.41.61.82.0

Theoretical Realized

TOPs

/sec

63% performance of GTX 295

2X performance of GTS 250

Page 28: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science28

Efficiency vs. GPGPU

MRI.FH MRI.Q CT.segment CT.laplace CT.gauss0

10

20

30

40

50

60

GTS 250 GTX 260 GTX 280 GTX 285 GTX 295

PUM

A Pe

rf/P

ower

ef

-fic

ienc

y ov

er G

PU 22X

54X

Page 29: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science29

Conclusions• Power-efficient accelerator for medical imaging• ASIC-like efficiency with programmability• 63-201% of GPU performance• 22-54X GPU Performance/Power efficiency

Page 30: Power-Efficient Medical Image Processing using PUMA

University of MichiganElectrical Engineering and Computer Science30

Thank you!!

Questions?