GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT...

GPUs in CT Reconstruction Logan Johnson

<3

2

1

Agenda

• Introduction

• CT Essentials

• Forward Projection

• GPU Programming 101

• GPU Optimization of Forward Projector

This Guy

Professional • BS Bioengineering - Clemson University (2009)

• 5 years at GE Healthcare, CT Recon

• Just started at NeuroLogica, Mobile CT Recon

• Algorithm design and optimization – CPU, GPU, and Xeon Phi architectures

– CUDA and OpenCL

Unprofessional • Runner, writer, and digital artist

• Lover of “coffeine” and scotch

Glennfiddich distillery in Dufftown, Scotland

CT ESSENTIALS

What is CT?

Biggest drawback:

Irradiates patient (and potential use of contrast agent)

2

3D Imaging

Trauma/ER Cardiac

Perfusion

Hard tissues

Guided surgery

Great for:

1

What is CT really?

https://www.youtube.com/watch?v=2CWpZKuy-NE

CT Reconstruction in a Nutshell

SCAN

RECONSTRUCT

CO

RR

ECT

RAW PROJECTIONS

SINOGRAM IMAGES

1

2

FBP or Iterative

Filtered Back Projection

Fourier + Radon transform based algorithm

1

1

CT scan is like a Radon transform of a patient. Goal is to inverse Radon transform (FBP) to recover anatomy.

F BP

Core FBP Reconstruction Math-magics

Raw View

Calibration

Beer’s Law vout = -ln(vin/vref)

vout = vin * gain + offsets

Filter

Rebinning

vout = conv1D(vin, rampFilter)

vout = interp2D(vin-100, vin, vin+100)

Step Output Projection Simplified Math

Generally, core steps are easily parallelizable algorithms and projections can be processed independently of one another (except rebin).

vout = raw scanner data

Back Projection

Final step is Back Projection, which is also easily parallelizable but requires many projections.

1

More Reasons for using GPU

Reasons:

• Off-the-shelf technology = cost savings

• Much better performance than x86/64

• Easier to program/develop than FPGA

• Floating point performance > FPGA

Draw-backs:

• Short GPU life cycle = more cost in V&V, inventory

Full-body scan of 6’ patient ready in < 5 minutes

Iterative Reconstruction

Improvements in HPC technology enable more sophisticated reconstruction algorithms

1

GE Veo Model Based Iterative Reconstruction (MBIR) on BladeCenter


GE

V

EO

Siem

ens

IRIS

Siem

ens

- SA

FIR

E P

hill

ips

- iD

ose

1

1

Algorithms are generally much more complex than FBP, therefore slower


1

You get what you compute for.


1

Next big challenge in CT imaging for GPUs – Veo quality at SAFIRE/iDose speeds

FORWARD PROJECTION

What is Forward Projection?

SCAN

Forward Project

CORRECTED PROJECTIONS

RE-PROJECTIONS

Forward projection is like simulating a CT scan. The input to this simulation are CT images. Reprojections should be similar to original corrected projections which made the aforementioned input images.

1

2

What is Forward Projection?

1

Modeling X-Ray Transmission

−ln𝜆𝑜𝑢𝑡𝜆𝑖𝑛= 𝑎𝑖𝑙𝑖𝑖

𝜆𝑖𝑛

𝜆𝑜𝑢𝑡

Intensity, 𝜆, decreases as beam passes through object

Σ𝑜𝑢𝑡 = 𝑎𝑖𝑙𝑖𝑖

Σ𝑖𝑛

Σ𝑜𝑢𝑡

Real System FP with CT Image Input

Sum of attenuations, Σ, increases as ray passes through image

Beer (-Lambert)’s Law!

1

2


Σ𝑜𝑢𝑡 = 𝑎𝑖𝑙𝑖𝑖

Summing attenuation values

For each row, compute attenuation by interpolating between pixels at intersection with ray. Add these to an accumulator, and multiply the result by the geometric scaling factor, 𝑙𝑖, since this value is constant for all rows for this particular ray.

3 5

5 1 𝑎𝑛 = 5 ∗ .5 + 1 ∗ .5 = 3

𝑎𝑛+1 = 3 ∗ .2 + 5 ∗ .8 = 4.6

𝜃 𝑙𝑖 =

ℎ

sin (𝜃)

…

…

ℎ


“Walking” across just rows

“Walking” across rows OR columns

Choose between sampling pattern with “if |cos(𝜃)| > |sin(𝜃)|”, where 𝜃 is the ray angle

Just two samples?

That’s more like it.

Modeling a CT Scanner

sou

rce to iso

center

sou

rce to d

etector

Detector channel radial width

Detector row width

det

ecto

r R

ow

s

detector Channels

X-Ray Source (Tube)

X-ray source and detector rotate around isocenter. Detector channels are equiangularly spaced w.r.t. to source. Rows are all the same width.

CT Detector

1

2

3

4

Modeling a CT Scanner

One rotation 21 equally spaced views

(Not a realistic scan)

View

0

View

1

View

2

…

View

20

View

19

View

18

Two key parameters – views (exposures) per rotation and rotation speed.

-180° -90° 0° 90° 180°

1

2

Ray Driven Cone Beam

Forward Projection

One rotation per N seconds, M equally spaced views Want to compute projection for each ray at each view location.

View

0

View

1

View

2

…

View

M

View

M-1

Vie

w M

-2

ray

chan

nel

dir

ecti

on

ray row direction

walk across IMAGE COLUMNS

walk across IMAGE ROWS


3D Ray Tracing!

x

y

In-plane Geometry

ROTATE

Total output elements = rows * channels * views

-z ← Out of plane geometry → +z

1

GPU PROGRAMMING 101

Programming GPUs

• Compute Uniform Device Architecture • NVidia Proprietary • GPU only • Block size and grid size

• Open Compute Language • Khronos Group open standard • AMD, NVidia, Intel, Altera, Xilinx • GPU, CPU, Phi, FPGA, others (?) • Global work size and work group size

Very similar paradigms, and both are C/C++ API’s. Comparing CUDA to OpenCL is like comparing Java to C++

1

2

CUDA Programming Model

Key concepts: • SIMT – Single Instruction Multiple Threads • 32 threads / warp • Threads are grouped into blocks • One warp worth of threads are executed in

parallel per compute unit. • Each warp executes same instruction at the same

time – lock-step execution • Branch divergence when threads within half-warp

choose different logical paths

1

Architecture

NVidia Maxwell (GM204)

32 cores/SM for 1 warp 1 2

Architecture

NVidia Maxwell (GM204)

32 cores/SM for 1 warp 298 mm2

1 2

Memory Architecture

1 to 32 cycles

1 cycle

400 to 600 cycles

Avoid global memory accesses, try to use shared memory.

Access latency

1

Performance Optimization

Tools NVidia NVVP AMD CodeXL

Knowledge GPU Gems

AMD/NVidia Programming Guides Experience

Creativity (borderline madness)

1

2,3

4

GPU OPTIMIZATION OF FORWARD PROJECTOR

Introduction

1

Experimental Setup

System Configuration • CUDA 6.5 • NVidia K20m • Visual Studio 2012

Projector Configuration • Joseph et. al 1982 Projection Model • 32 rows • 800 views per rotation • 1 rotation per second • 32 mm/s movement in Z • RTK 12 CPU Reference – 1473 seconds

NIH-NLM Visible Human Body Project

Frozen Female 512 (x) 512 (y) 1784(z) image matrix size

CT Scan Case

Reconstruction Toolkit (RTK) by Creatis, MGH, et. al also contains an excellent example of this algorithm in CUDA.

Performance Goal

Acquires 1 rotation per second

Have a performance goal before you begin designing! (Even if it’s roughly 1400x)

So…

Processing at least 1 rotation per second will ensure FP is not pipeline bottleneck

1 2

Naïve Implementation

Priorities:

• Needs to produce correct results

• Write GPU friendly code

if Avoid big if conditions

t0 t1 t2 t3

d0 d1 d2 d3

Output-driven parallelism (one thread / output element)

1

Trilinear In

terp

olatio

n

Z-coordinate computation

Weight, write final result to global memory once

Kernel Source Code

The Inner Loop – executed at most 512 times!

Somewhat redundant, but good for prototyping

Determine if walking across rows or columns Compute ray change in x and y accordingly

Compute line integral weighting

Kernel Source Code

Projection loops don’t need to be inside if condition. 1. Avoids unnecessary and costly warp divergence 2. Eliminates duplicate code

Results of Naïve Implementation

in a blazing

One rotation of data

433 seconds!!!

For this much of the anatomy

Performance Profiling

NVidia Visual Profiler

Very basic profiling on a 7 minute application took overnight to complete. Try running a smaller but representative case.

Performance Profiling

Complete profiling took 10 minutes for 16 of 800 views. Overflow issues still persistent, but sufficient information to begin optimizing.

Guided Performance Analysis

Helpful tool to run the most relevant profiling experiments for your kernel. Took five minutes.

Register Usage

Registers/thread mostly driven by number of variables in kernel

Executive summary on kernel performance

Register Usage

This function

Does ~30 loads

And hits peak register usage

nvdiasm gives some insight into what is using up all the registers

Register Usage

~30

ele

me

nts

Perhaps this huge structure causing a lot of register spillage in the inner most loop?

Register Usage

Remove covertImageCoordinatesToSpace from inner loop with algebraic factorization

433s / rotation 67 registers

27.8s / rotation 65 registers

Yet we still need 60+ registers.

Register Usage

Since we’re optimizing the inner loop….

27.8 s/rotation 65 registers

16.2 s/rotation 74 registers

Simplified calculations and introduced pitched memory (more on this later). What else changed that could have driven up register usage?

Register Usage

Went from a pointer to a struct

Passing big structs by value, not by reference, to CUDA kernel is apparently a bad idea.

27.8 s/rotation, 65 registers



Occupancy

10.3 s/rotation 12.5 s/rotation

Changing block size (for this algorithm) is simple and can quickly yield improvements in device utilization. Using shared memory might make such tweaks more challenging.

64 threads/block 128 threads/block

Occupancy

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

What can we do to further improve on 63% occupancy?

(Occupancy = WarpsPerSM / TotalSM * 100% )

But is it worth it?

Removing Expensive Instructions

IPC = Instructions Per Clock. Higher is faster. Expected from CUDA Programming Guide Measured from my laptop

Expected IPC Measured IPC

float32 add/mply 6 5.26

float32 divide ? 3.41

float32 rsqrtf() 1 1.2

float32 1.0f/rsqrtf() ? 1.1

float32 sqrtf() ? 1.08

int32 add 5 4.01

int32 mply 1 1.09

Quadro K4100M (3.0)

10.3 s/ rotation, 48 registers per thread

9.00 s/rotation, 39 registers per thread

Simple factorization removed 512 sqrt computations / thread, some less expensive multiplications, and some variables

Assessing Our Progress

433

27.78

16.2 12.5

10.3 8.995

1

10

100

1000

Naïve removedstruct from

loop

removedclamping

struct pointer block sizeoptimization

sqrt removal

Tim

e p

er

rota

tio

n [

s]

Forward Projector Performance

About 50x faster, but still need another 10x

First Pass Optimization

This might be a good point to profile an entire rotation

Where we started 433 seconds / rotation

Where we arrived 9 seconds / rotation

High-level Profile for Full Rotation

16 Projections (.5 s/rotation)

800 Projections (9 s/rotation)

The 16 projection experiment isn’t representative of the full experiment. Why the 18x difference?

What’s different?

One rotation per N seconds, M equally spaced views Want to compute projection for each ray at each view location.

View

0

View

1

View

2

…

View

M

View

M-1

Vie

w M

-2

ray

chan

nel

dir

ecti

on

ray row direction


walk across IMAGE ROWS


3D Ray Tracing!

x

y

In-plane Geometry

ROTATE

-z ← Out of plane geometry → +z

Processing more views means moving further in Z and changing rotation angles.

1

A Little Design of Experiment

0

20

40

60

80

100

120

140

160

180

200

150

250

350

450

550

650

750

850

950

1050

1150

0 100 200 300 400 500

% lo

ad e

ffic

ien

cy

Exe

cuti

on

Tim

e [

ms]

Number of Views

Adjusting Total Number of Views

Execution Time

Load Efficiency

0

20

40

60

80

100

120

140

160

180

200

150

155

160

165

170

175

0 100 200 300 400 500

% lo

ad e

ffic

ien

cy

No

rmal

ize

d E

xecu

tio

n T

ime

[m

s]

Number of Views

Adjusting Total Number of Views

Execution Time

Load Efficiency

If table position and gantry angle are held constant, the number of views has an expected linear impact on performance.

A Little Design of Experiment

0

20

40

60

80

100

120

140

160

180

200

150

200

250

300

350

400

450

0 20 40 60 80 100

% lo

ad e

ffic

ien

cy

Exe

cuti

on

Tim

e [

ms]

First View Location [mm]

Adjusting First View Location

Execution Time

Load Efficiency

0

20

40

60

80

100

120

140

160

180

200

150

160

170

180

190

200

210

220

0 2 4 6 8

% lo

ad e

ffic

ien

cy

Exe

cuti

on

Tim

e [

ms]

First View Angle [radians]

Adjusting Initial View Angle

Execution Time

Load Efficiency

Adjusting table position or gantry angle with a fixed number of views causes performance loss. Why?

First View Location

0

20

40

60

80

100

120

140

160

180

200

150

200

250

300

350

400

450

0 20 40 60 80 100

% lo

ad e

ffic

ien

cy

Exe

cuti

on

Tim

e [

ms]



Execution Time

Load Efficiency

1

2

3

4

5

6

1 6

1

2

3 4 5 6

The original 16 view test case (at position 1) wasn’t projecting much – many of its rays were completely outside of the image volume.

Positions 4-6 are more representative of actual performance.

First View

Initial Rotation Angle

0

20

40

60

80

100

120

140

160

180

200

150

160

170

180

190

200

210

220

0 2 4 6 8

% lo

ad e

ffic

ien

cy

Exe

cuti

on

Tim

e [

ms]



Execution Time

Load Efficiency

Load efficiency and execution time vary drastically with rotation angle. nvvp suggests that we check if our memory accesses are coalesced

0 1 2 3 4 5 6

7 8 9 10 11 12 13

14 15 16 17 18 19 20

21 22 23 24 25 26 27

28 29 30 31 32 33 34

35 36 37 38 39 40 41

42 43 44 45 46 47 48

Memory Coalescing 101

• Threads will project each row in parallel • For row 2, the threads will collectively need

to read memory elements 15, 16, 17, 18, and 19 at the same time.

• Since these elements are adjacent, the access is said to be coalesced.

• How coalesced depends on alignment, the total number of bytes read, etc.

• Best case, these elements can be read in one transaction Fo

r ea

ch r

ow

, eac

h t

hre

ad w

ill in

par

alle

l pro

ject

a p

ixel

row 0

row 1

row 2

row 3

…

1

0 1 2 3 4 5 6

7 8 9 10 11 12 13

14 15 16 17 18 19 20

21 22 23 24 25 26 27

28 29 30 31 32 33 34

35 36 37 38 39 40 41

42 43 44 45 46 47 48

Memory Not Coalescing 101

• Threads will project each column in parallel • For column 4, the threads will collectively

need to read memory elements 11, 18, 25, 32, and 39 at the same time.

• Since these elements are NOT adjacent, the access are likely not coalesced.

• How not coalesced depends on alignment, how far apart the elements are, etc.

• Worst case, these elements will be read in five transactions

For each column, each thread will in parallel project a pixel co

lum

n 6

colu

mn

5

colu

mn

4

colu

mn

3

…

The projector rotates 360 degrees, so our accesses will have periodically bad efficiency!

1

Some Thoughts on Design of Experiments

0

20

40

60

80

100

120

140

160

180

200

150

200

250

300

350

400

450

0 20 40 60 80 100

% lo

ad e

ffic

ien

cy

Exe

cuti

on

Tim

e [

ms]



Execution Time

Load Efficiency

0

20

40

60

80

100

120

140

160

180

200

150

160

170

180

190

200

210

220

0 2 4 6 8

% lo

ad e

ffic

ien

cy

Exe

cuti

on

Tim

e [

ms]



Execution Time

Load Efficiency

But what are we going to do about that coalescing problem?

Make sure to test all key variables while optimizing to save on embarrassment later on.

Revisiting the sampling problem

“Walking” across just rows

“Walking” across rows OR columns

Choose between sampling pattern with “if |cos(𝜃)| > |sin(𝜃)|”, where 𝜃 is the ray angle

Just two samples?

That’s more like it.

Transposed Matrix

Instead of walking across columns, walk across rows of a transposed image

1

Improvement using transposed matrix

Was: 9 seconds/rotation Now: 2.97 seconds / rotation

32 registers – disabled debugging features, now 100% occupancy

Another way to deal with overflow problems is to break up the whole experiment into parts!

Tweaking Block Size

Was: 2.97 seconds/rotation Now: 1.8 seconds / rotation

32 registers – disabled debugging features

Changed from [16, 8, 1] to [16, 1, 8]

Taking Tally

433

27.78

16.2 12.5 10.3 8.995

2.97

1.8

1

10

100

1000


loop

removedclamping

structpointer

block sizeoptimization

sqrt removal transposedmatrix


Tim

e p

er

rota

tio

n [

s]


Lets fix the first view location for the original benchmark

Taking Correct Tally

733

40.5

22.5 16.4 15.7

3.34

2.02

1

10

100

1000


loop

removedclamping

struct pointer sqrt removal+ block size

opt.

TransposedMatrix



Off by 2x. What next?

Note on performance linearity: 16 views -> 2.1 s/ rotation 800 views -> 2.0 s/ rotation

Guided Profile Analysis

nvvp says latency is the bottleneck


now nvvp is telling us that occupancy is the bottleneck


Profiler is giving us the run around. Guess it doesn’t know how to improve performance.

Unguided Profile Analysis

The inner most loop is essentially 3D interpolation. What can be done to accelerate these computations?

Texture Memory

Hardware accelerated 8-bit 2D/3D-interpolation

Morton-ordering like schemes are used in texture hardware

1

2

Texture Memory

Texture hardware handles both interpolation computations and boundary checking.

Improvement using textures

.475 seconds / rotation, .429 seconds/rotation with another block size tweak and .64 seconds/rotation including transfer times.

VICTORY

So Sweet

733

40.5

22.5 16.4 15.7

3.34

2.02

0.64

0.1

1

10

100

1000

Naïve removed structfrom loop

removedclamping

struct pointer sqrt removal +block size opt.

TransposedMatrix


image textures


Verify Outputs

Difference b/w original and fully optimized Sinogram output (reformatted views)

Same results as naïve within +/- 0.7%, but in .6 s instead of 733 seconds. Also ~2800x faster than “reference” CPU implementation! (I think something is wrong with it)

Further GPU Optimization Reading

• Asynchronous compute and transfer

• Shared memory

• Multiple GPUs

GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT...

Documents

Transcript of GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT...