University of Michigan Electrical Engineering and Computer Science Power-Efficient Medical Image...

30
University of Michigan Electrical Engineering and Computer Science Power-Efficient Medical Image Processing using PUMA Ganesh Dasika, Kevin Fan 1 , Scott Mahlke 1 Parakinetics, Inc. University of Michigan Advanced Computer Architecture Laboratory
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    1

Transcript of University of Michigan Electrical Engineering and Computer Science Power-Efficient Medical Image...

University of MichiganElectrical Engineering and Computer Science

Power-Efficient Medical Image Processing using PUMA

Ganesh Dasika, Kevin Fan1, Scott Mahlke

1Parakinetics, Inc.

University of MichiganAdvanced Computer Architecture Laboratory

University of MichiganElectrical Engineering and Computer Science2

The Advent of the GPGPU

• Increasingly popular substrate for HPC– Astrophysics– Weather Prediction– EDA– Financial instrument pricing– Medical Imaging

University of MichiganElectrical Engineering and Computer Science3

Advantages of GPGPUs

• High degree of parallelism– Data-level– Thread-level

• High bandwidth• Commodity products• Increasingly programmable

University of MichiganElectrical Engineering and Computer Science4

Disadvantages of GPGPUs

• Gap between computation and bandwidth– 933 GFLOPS : 142 GB/s bandwidth

(0.15B of data per FLOP, ~26:1 Compute:Mem Ratio)• Very high power consumption

– Graphics-specific hardware– Multiple thread contexts– Large register files and memories– Fully general datapath

Inefficiencies in allgeneral-purpose architectures

University of MichiganElectrical Engineering and Computer Science5

Programmability vs Efficiency?

FPGAs

General PurposeProcessors

DSPs

Domain-specificAccelerators,

GPGPUs

Efficiency

Flex

ibilit

y

5

Loop Accelerators,ASICs

???

Highly efficient,some programmability

University of MichiganElectrical Engineering and Computer Science6

Medical Image Reconstruction

• Compute intensive loops– 32-bit floating point code– High data/bandwidth requirements

• Increased demand for portability, low power• Much current research focuses on using GPGPUs

for this domain

University of MichiganElectrical Engineering and Computer Science7

CT Image reconstruction

• X-Ray emitters and receptors on opposite sides of patients

• Received x-ray intensity corresponds to tissue density

• Multiple scans (“slices”) taken around patient put together to reconstruct 1 2D-image

University of MichiganElectrical Engineering and Computer Science8

Projection & Sinogram

Sinogram:All projections

Projection:All ray-sums in a direction

P(t)

f(x,y)

t

y

x

X-raysSinogram

t

p

University of MichiganElectrical Engineering and Computer Science9

Example: Backprojection

Sinogram Backprojected Image

University of MichiganElectrical Engineering and Computer Science10

Example:Filtered Backprojection

Filtered Sinogram Reconstructed Image

University of MichiganElectrical Engineering and Computer Science11

Reconstruction: Solve for m’s

m11 m12 m13 m14

m21 m22 m23 m24

m31 m32 m33 m34

m41 m42 m43 m44

16 22 11 10

X-RayEmitter

DetectorValues

Densities

“Human Body“

22

12

10

15

University of MichiganElectrical Engineering and Computer Science12

Real Reconstruction Problem

• Intensity measured • Rays transmitted

through multiple “pixels”

• Find individual “pixel” values from transmission data

? ? ? ?? ?

? ? ? ?? ?

? ? ? ?? ?

? ? ? ?? ?

? ? ? ?? ?

? ? ? ?? ? 534

417

364

555

501

355

255

712199

512 values

512values

100’s of diagonals @

100’s of angles

University of MichiganElectrical Engineering and Computer Science13

Medical Imaging Applications

• Image reconstruction for MRI/CT/PET scans• Large amounts of Vector/Thread-level parallelism• FP-intensive kernels

– Often requiring math library functions• Data-intensive (~5:1 compute:mem ratio)

Benchmark Inner-loop%Scalar/Vector Outer-loop TLP Compute:Mem

ratio

Segmentation Fully vectorizable Do-all 4:1

Laplacian Filtering Fully vectorizable Do-all 3:1Gaussian

ConvolutionFully vectorizable with predicates Do-all 6:1

MRI FH Vector Fully vectorizable Do-all 6:1

MRI Q Vector Fully vectorizable Do-all 5.5:1

University of MichiganElectrical Engineering and Computer Science14

• Currently, most scans requiremoving patient to imaging room– Consumes time– Stress on patient

• Studies show benefits of portable, bed-side scanners:– 86% increase in patients suitable for post-stroke thrombolytic

therapy [Weinreb et al, RSNA]– 80-100% drop in scan-related complications

[Gunnarsson et al, J. of Neurosurgery]• New X-Ray emitters push for mAs of current use

Current Concerns: Portability/Power

University of MichiganElectrical Engineering and Computer Science15

Current Concerns: Performance

• High-accuracy CT algorithms take too long– Iterative forward/backward

projection– ~Hours on modern CT scanners

instead of minutes• Interventional radiology

– Scans currently takes minutes, but should take seconds

• CT-Flouroscopy– Several scans done in

succession

University of MichiganElectrical Engineering and Computer Science16

Flexibility

• Software algorithms change over time• NRE• Time-to-market

16

University of MichiganElectrical Engineering and Computer Science17

PUMA

• Tiled architecture• Bandwidth-matched for

improved efficiency• Each tile is a

“Programmable Loop Accelerator” Extern. Interface

CPU Mem Disk …

University of MichiganElectrical Engineering and Computer Science18

Programmable Loop Accelerator

• Generalize accelerator without losing efficiency

FPGAs

Efficiency, Performance

Flex

ibilit

y

Loop Accelerators,ASICs

ProgrammableLoop Accelerators

18

General PurposeProcessors

DSPs

Domain-specificAccelerators,

GPGPUs???

University of MichiganElectrical Engineering and Computer Science19

Designing Loop Accelerators

C Code Loop

19

Hardware

Point-to-point Connections

BR

CRF

+

… …

&

… …

MEM

… …

LocalMem

+

……

*

……

MEM

……

<<

……

LocalMem

University of MichiganElectrical Engineering and Computer Science20

Loop Accelerator Architecture

Point-to-point Connections

+

… …

&

… …

MEM

… …

LocalMem

FSM

Controlsignals

CRF

BR

Hardware realization of modulo scheduled loopParameterized hardware:• FUs• Shift Register Files

20

• Static Control• Point-to-point Interconnect

University of MichiganElectrical Engineering and Computer Science21

Programmable Loop-Accelerator Architecture

Point-to-point Connections

+/-

… …

&/|

… …

MEM

… …

LocalMem

ControlMemory

Controlsignals

CRF

BR

RR RRRRRR

Literals

Ring

Functionality Storage Connectivity Control

LA PLACustom FU set Generalized FUs + MOVs

Point-to-point Ring + Port-swapping

Limited size, no addr. Rotating Reg. Files

Hardwired Control Lit. Reg. File + Control Mem

21

+ &

SRF SRFSRFSRF

FSM

University of MichiganElectrical Engineering and Computer Science22

MRI.FH PLA

• ~0.6 mm2 per tile• 38 FUs• 128 32-bit registers• Inter-FU BW 1 TB/sec

FU Type #

FP-ADDSUB 6

FP-MPY 9

I-ADDSUB 8

MEM 9

I-MPY 1

Other 5

University of MichiganElectrical Engineering and Computer Science23

Performance on MRI.FH PLA

MRI.FH MRI.Q CT.segment CT.laplace CT.gauss0.0

0.2

0.4

0.6

0.8

1.0

Non-Generalized Generalized

Nor

mal

ized

Per

form

ance

II preserved

II doubled

Unschedulable

University of MichiganElectrical Engineering and Computer Science24

Efficiency on MRI.FH PLA

MRI.FH MRI.Q CT.segment CT.laplace CT.gauss mean0.0

0.2

0.4

0.6

0.8

1.0

Non-Generalized Generalized

Nor

mal

ized

P

erf/P

ower

Ef-

ficie

ncy

University of MichiganElectrical Engineering and Computer Science25

PUMA System Design

• 5 systems designed around 5 benchmarks

• Each composed of identical tiles

• Assume same B/W as GTX280 (142 GB/s)

• # Tiles based on B/W requirements of benchmark

Extern. Interface

CPU Mem Disk …

University of MichiganElectrical Engineering and Computer Science26

System Performance

MRI.FH MRI.Q CT.segment CT.laplace CT.gauss0

20406080

100120140160

Theoretical Realized

GO

Ps/s

ec

4W 3W 2.8W 2.3W 2.7W

University of MichiganElectrical Engineering and Computer Science27

Performance vs. GPGPU

PUMA GTS 250 GTX 260 GTX 280 GTX 285 GTX 2950.00.20.40.60.81.01.21.41.61.82.0

Theoretical Realized

TOPs

/sec

63% performance of GTX 295

2X performance of GTS 250

University of MichiganElectrical Engineering and Computer Science28

Efficiency vs. GPGPU

MRI.FH MRI.Q CT.segment CT.laplace CT.gauss0

10

20

30

40

50

60

GTS 250 GTX 260 GTX 280 GTX 285 GTX 295

PUM

A P

erf/

Pow

er e

f-fic

ienc

y ov

er G

PU 22X

54X

University of MichiganElectrical Engineering and Computer Science29

Conclusions

• Power-efficient accelerator for medical imaging• ASIC-like efficiency with programmability• 63-201% of GPU performance• 22-54X GPU Performance/Power efficiency

University of MichiganElectrical Engineering and Computer Science30

Thank you!!

Questions?