GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT...
Transcript of GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT...
GPUs in CT Reconstruction Logan Johnson
<3
2
1
Agenda
• Introduction
• CT Essentials
• Forward Projection
• GPU Programming 101
• GPU Optimization of Forward Projector
This Guy
Professional • BS Bioengineering - Clemson University (2009)
• 5 years at GE Healthcare, CT Recon
• Just started at NeuroLogica, Mobile CT Recon
• Algorithm design and optimization – CPU, GPU, and Xeon Phi architectures
– CUDA and OpenCL
Unprofessional • Runner, writer, and digital artist
• Lover of “coffeine” and scotch
Glennfiddich distillery in Dufftown, Scotland
CT ESSENTIALS
What is CT?
Biggest drawback:
Irradiates patient (and potential use of contrast agent)
2
3D Imaging
Trauma/ER Cardiac
Perfusion
Hard tissues
Guided surgery
Great for:
1
What is CT really?
https://www.youtube.com/watch?v=2CWpZKuy-NE
CT Reconstruction in a Nutshell
SCAN
RECONSTRUCT
CO
RR
ECT
RAW PROJECTIONS
SINOGRAM IMAGES
1
2
FBP or Iterative
Filtered Back Projection
Fourier + Radon transform based algorithm
1
1
CT scan is like a Radon transform of a patient. Goal is to inverse Radon transform (FBP) to recover anatomy.
F BP
Core FBP Reconstruction Math-magics
Raw View
Calibration
Beer’s Law vout = -ln(vin/vref)
vout = vin * gain + offsets
Filter
Rebinning
vout = conv1D(vin, rampFilter)
vout = interp2D(vin-100, vin, vin+100)
Step Output Projection Simplified Math
Generally, core steps are easily parallelizable algorithms and projections can be processed independently of one another (except rebin).
vout = raw scanner data
Back Projection
Final step is Back Projection, which is also easily parallelizable but requires many projections.
1
More Reasons for using GPU
Reasons:
• Off-the-shelf technology = cost savings
• Much better performance than x86/64
• Easier to program/develop than FPGA
• Floating point performance > FPGA
Draw-backs:
• Short GPU life cycle = more cost in V&V, inventory
Full-body scan of 6’ patient ready in < 5 minutes
Iterative Reconstruction
Improvements in HPC technology enable more sophisticated reconstruction algorithms
1
GE Veo Model Based Iterative Reconstruction (MBIR) on BladeCenter
Iterative Reconstruction
GE
V
EO
Siem
ens
IRIS
Siem
ens
- SA
FIR
E P
hill
ips
- iD
ose
1
1
Algorithms are generally much more complex than FBP, therefore slower
Iterative Reconstruction
1
You get what you compute for.
Iterative Reconstruction
1
Next big challenge in CT imaging for GPUs – Veo quality at SAFIRE/iDose speeds
FORWARD PROJECTION
What is Forward Projection?
SCAN
Forward Project
CORRECTED PROJECTIONS
RE-PROJECTIONS
Forward projection is like simulating a CT scan. The input to this simulation are CT images. Reprojections should be similar to original corrected projections which made the aforementioned input images.
1
2
What is Forward Projection?
1
Modeling X-Ray Transmission
−ln𝜆𝑜𝑢𝑡𝜆𝑖𝑛= 𝑎𝑖𝑙𝑖𝑖
𝜆𝑖𝑛
𝜆𝑜𝑢𝑡
Intensity, 𝜆, decreases as beam passes through object
Σ𝑜𝑢𝑡 = 𝑎𝑖𝑙𝑖𝑖
Σ𝑖𝑛
Σ𝑜𝑢𝑡
Real System FP with CT Image Input
Sum of attenuations, Σ, increases as ray passes through image
Beer (-Lambert)’s Law!
1
2
Modeling X-Ray Transmission
Σ𝑜𝑢𝑡 = 𝑎𝑖𝑙𝑖𝑖
Summing attenuation values
For each row, compute attenuation by interpolating between pixels at intersection with ray. Add these to an accumulator, and multiply the result by the geometric scaling factor, 𝑙𝑖, since this value is constant for all rows for this particular ray.
3 5
5 1 𝑎𝑛 = 5 ∗ .5 + 1 ∗ .5 = 3
𝑎𝑛+1 = 3 ∗ .2 + 5 ∗ .8 = 4.6
𝜃 𝑙𝑖 =
ℎ
sin (𝜃)
…
…
ℎ
Modeling X-Ray Transmission
“Walking” across just rows
“Walking” across rows OR columns
Choose between sampling pattern with “if |cos(𝜃)| > |sin(𝜃)|”, where 𝜃 is the ray angle
Just two samples?
That’s more like it.
Modeling a CT Scanner
sou
rce to iso
center
sou
rce to d
etector
Detector channel radial width
Detector row width
det
ecto
r R
ow
s
detector Channels
X-Ray Source (Tube)
X-ray source and detector rotate around isocenter. Detector channels are equiangularly spaced w.r.t. to source. Rows are all the same width.
CT Detector
1
2
3
4
Modeling a CT Scanner
One rotation 21 equally spaced views
(Not a realistic scan)
View
0
View
1
View
2
…
View
20
View
19
View
18
Two key parameters – views (exposures) per rotation and rotation speed.
-180° -90° 0° 90° 180°
1
2
Ray Driven Cone Beam
Forward Projection
One rotation per N seconds, M equally spaced views Want to compute projection for each ray at each view location.
View
0
View
1
View
2
…
View
M
View
M-1
Vie
w M
-2
ray
chan
nel
dir
ecti
on
ray row direction
walk across IMAGE COLUMNS
walk across IMAGE ROWS
walk across IMAGE COLUMNS
3D Ray Tracing!
x
y
In-plane Geometry
ROTATE
Total output elements = rows * channels * views
-z ← Out of plane geometry → +z
1
GPU PROGRAMMING 101
Programming GPUs
• Compute Uniform Device Architecture • NVidia Proprietary • GPU only • Block size and grid size
• Open Compute Language • Khronos Group open standard • AMD, NVidia, Intel, Altera, Xilinx • GPU, CPU, Phi, FPGA, others (?) • Global work size and work group size
Very similar paradigms, and both are C/C++ API’s. Comparing CUDA to OpenCL is like comparing Java to C++
1
2
CUDA Programming Model
Key concepts: • SIMT – Single Instruction Multiple Threads • 32 threads / warp • Threads are grouped into blocks • One warp worth of threads are executed in
parallel per compute unit. • Each warp executes same instruction at the same
time – lock-step execution • Branch divergence when threads within half-warp
choose different logical paths
1
Architecture
NVidia Maxwell (GM204)
32 cores/SM for 1 warp 1 2
Architecture
NVidia Maxwell (GM204)
32 cores/SM for 1 warp 298 mm2
1 2
Memory Architecture
1 to 32 cycles
1 cycle
400 to 600 cycles
Avoid global memory accesses, try to use shared memory.
Access latency
1
Performance Optimization
Tools NVidia NVVP AMD CodeXL
Knowledge GPU Gems
AMD/NVidia Programming Guides Experience
Creativity (borderline madness)
1
2,3
4
GPU OPTIMIZATION OF FORWARD PROJECTOR
Introduction
1
Experimental Setup
System Configuration • CUDA 6.5 • NVidia K20m • Visual Studio 2012
Projector Configuration • Joseph et. al 1982 Projection Model • 32 rows • 800 views per rotation • 1 rotation per second • 32 mm/s movement in Z • RTK 12 CPU Reference – 1473 seconds
NIH-NLM Visible Human Body Project
Frozen Female 512 (x) 512 (y) 1784(z) image matrix size
CT Scan Case
Reconstruction Toolkit (RTK) by Creatis, MGH, et. al also contains an excellent example of this algorithm in CUDA.
Performance Goal
Acquires 1 rotation per second
Have a performance goal before you begin designing! (Even if it’s roughly 1400x)
So…
Processing at least 1 rotation per second will ensure FP is not pipeline bottleneck
1 2
Naïve Implementation
Priorities:
• Needs to produce correct results
• Write GPU friendly code
if Avoid big if conditions
t0 t1 t2 t3
d0 d1 d2 d3
Output-driven parallelism (one thread / output element)
1
Trilinear In
terp
olatio
n
Z-coordinate computation
Weight, write final result to global memory once
Kernel Source Code
The Inner Loop – executed at most 512 times!
Somewhat redundant, but good for prototyping
Determine if walking across rows or columns Compute ray change in x and y accordingly
Compute line integral weighting
Kernel Source Code
Projection loops don’t need to be inside if condition. 1. Avoids unnecessary and costly warp divergence 2. Eliminates duplicate code
Results of Naïve Implementation
in a blazing
One rotation of data
433 seconds!!!
For this much of the anatomy
Performance Profiling
NVidia Visual Profiler
Very basic profiling on a 7 minute application took overnight to complete. Try running a smaller but representative case.
Performance Profiling
Complete profiling took 10 minutes for 16 of 800 views. Overflow issues still persistent, but sufficient information to begin optimizing.
Guided Performance Analysis
Helpful tool to run the most relevant profiling experiments for your kernel. Took five minutes.
Register Usage
Registers/thread mostly driven by number of variables in kernel
Executive summary on kernel performance
Register Usage
This function
Does ~30 loads
And hits peak register usage
nvdiasm gives some insight into what is using up all the registers
Register Usage
~30
ele
me
nts
Perhaps this huge structure causing a lot of register spillage in the inner most loop?
Register Usage
Remove covertImageCoordinatesToSpace from inner loop with algebraic factorization
433s / rotation 67 registers
27.8s / rotation 65 registers
Yet we still need 60+ registers.
Register Usage
Since we’re optimizing the inner loop….
27.8 s/rotation 65 registers
16.2 s/rotation 74 registers
Simplified calculations and introduced pitched memory (more on this later). What else changed that could have driven up register usage?
Register Usage
Went from a pointer to a struct
Passing big structs by value, not by reference, to CUDA kernel is apparently a bad idea.
27.8 s/rotation, 65 registers
16.2 s/rotation, 74 registers
12.5 s/rotation, 48 registers
Occupancy
10.3 s/rotation 12.5 s/rotation
Changing block size (for this algorithm) is simple and can quickly yield improvements in device utilization. Using shared memory might make such tweaks more challenging.
64 threads/block 128 threads/block
Occupancy
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
What can we do to further improve on 63% occupancy?
(Occupancy = WarpsPerSM / TotalSM * 100% )
But is it worth it?
Removing Expensive Instructions
IPC = Instructions Per Clock. Higher is faster. Expected from CUDA Programming Guide Measured from my laptop
Expected IPC Measured IPC
float32 add/mply 6 5.26
float32 divide ? 3.41
float32 rsqrtf() 1 1.2
float32 1.0f/rsqrtf() ? 1.1
float32 sqrtf() ? 1.08
int32 add 5 4.01
int32 mply 1 1.09
Quadro K4100M (3.0)
10.3 s/ rotation, 48 registers per thread
9.00 s/rotation, 39 registers per thread
Simple factorization removed 512 sqrt computations / thread, some less expensive multiplications, and some variables
Assessing Our Progress
433
27.78
16.2 12.5
10.3 8.995
1
10
100
1000
Naïve removedstruct from
loop
removedclamping
struct pointer block sizeoptimization
sqrt removal
Tim
e p
er
rota
tio
n [
s]
Forward Projector Performance
About 50x faster, but still need another 10x
First Pass Optimization
This might be a good point to profile an entire rotation
Where we started 433 seconds / rotation
Where we arrived 9 seconds / rotation
High-level Profile for Full Rotation
16 Projections (.5 s/rotation)
800 Projections (9 s/rotation)
The 16 projection experiment isn’t representative of the full experiment. Why the 18x difference?
What’s different?
One rotation per N seconds, M equally spaced views Want to compute projection for each ray at each view location.
View
0
View
1
View
2
…
View
M
View
M-1
Vie
w M
-2
ray
chan
nel
dir
ecti
on
ray row direction
walk across IMAGE COLUMNS
walk across IMAGE ROWS
walk across IMAGE COLUMNS
3D Ray Tracing!
x
y
In-plane Geometry
ROTATE
-z ← Out of plane geometry → +z
Processing more views means moving further in Z and changing rotation angles.
1
A Little Design of Experiment
0
20
40
60
80
100
120
140
160
180
200
150
250
350
450
550
650
750
850
950
1050
1150
0 100 200 300 400 500
% lo
ad e
ffic
ien
cy
Exe
cuti
on
Tim
e [
ms]
Number of Views
Adjusting Total Number of Views
Execution Time
Load Efficiency
0
20
40
60
80
100
120
140
160
180
200
150
155
160
165
170
175
0 100 200 300 400 500
% lo
ad e
ffic
ien
cy
No
rmal
ize
d E
xecu
tio
n T
ime
[m
s]
Number of Views
Adjusting Total Number of Views
Execution Time
Load Efficiency
If table position and gantry angle are held constant, the number of views has an expected linear impact on performance.
A Little Design of Experiment
0
20
40
60
80
100
120
140
160
180
200
150
200
250
300
350
400
450
0 20 40 60 80 100
% lo
ad e
ffic
ien
cy
Exe
cuti
on
Tim
e [
ms]
First View Location [mm]
Adjusting First View Location
Execution Time
Load Efficiency
0
20
40
60
80
100
120
140
160
180
200
150
160
170
180
190
200
210
220
0 2 4 6 8
% lo
ad e
ffic
ien
cy
Exe
cuti
on
Tim
e [
ms]
First View Angle [radians]
Adjusting Initial View Angle
Execution Time
Load Efficiency
Adjusting table position or gantry angle with a fixed number of views causes performance loss. Why?
First View Location
0
20
40
60
80
100
120
140
160
180
200
150
200
250
300
350
400
450
0 20 40 60 80 100
% lo
ad e
ffic
ien
cy
Exe
cuti
on
Tim
e [
ms]
First View Location [mm]
Adjusting First View Location
Execution Time
Load Efficiency
1
2
3
4
5
6
1 6
1
2
3 4 5 6
The original 16 view test case (at position 1) wasn’t projecting much – many of its rays were completely outside of the image volume.
Positions 4-6 are more representative of actual performance.
First View
Initial Rotation Angle
0
20
40
60
80
100
120
140
160
180
200
150
160
170
180
190
200
210
220
0 2 4 6 8
% lo
ad e
ffic
ien
cy
Exe
cuti
on
Tim
e [
ms]
First View Angle [radians]
Adjusting Initial View Angle
Execution Time
Load Efficiency
Load efficiency and execution time vary drastically with rotation angle. nvvp suggests that we check if our memory accesses are coalesced
0 1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31 32 33 34
35 36 37 38 39 40 41
42 43 44 45 46 47 48
Memory Coalescing 101
• Threads will project each row in parallel • For row 2, the threads will collectively need
to read memory elements 15, 16, 17, 18, and 19 at the same time.
• Since these elements are adjacent, the access is said to be coalesced.
• How coalesced depends on alignment, the total number of bytes read, etc.
• Best case, these elements can be read in one transaction Fo
r ea
ch r
ow
, eac
h t
hre
ad w
ill in
par
alle
l pro
ject
a p
ixel
row 0
row 1
row 2
row 3
…
1
0 1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31 32 33 34
35 36 37 38 39 40 41
42 43 44 45 46 47 48
Memory Not Coalescing 101
• Threads will project each column in parallel • For column 4, the threads will collectively
need to read memory elements 11, 18, 25, 32, and 39 at the same time.
• Since these elements are NOT adjacent, the access are likely not coalesced.
• How not coalesced depends on alignment, how far apart the elements are, etc.
• Worst case, these elements will be read in five transactions
For each column, each thread will in parallel project a pixel co
lum
n 6
colu
mn
5
colu
mn
4
colu
mn
3
…
The projector rotates 360 degrees, so our accesses will have periodically bad efficiency!
1
Some Thoughts on Design of Experiments
0
20
40
60
80
100
120
140
160
180
200
150
200
250
300
350
400
450
0 20 40 60 80 100
% lo
ad e
ffic
ien
cy
Exe
cuti
on
Tim
e [
ms]
First View Location [mm]
Adjusting First View Location
Execution Time
Load Efficiency
0
20
40
60
80
100
120
140
160
180
200
150
160
170
180
190
200
210
220
0 2 4 6 8
% lo
ad e
ffic
ien
cy
Exe
cuti
on
Tim
e [
ms]
First View Angle [radians]
Adjusting Initial View Angle
Execution Time
Load Efficiency
But what are we going to do about that coalescing problem?
Make sure to test all key variables while optimizing to save on embarrassment later on.
Revisiting the sampling problem
“Walking” across just rows
“Walking” across rows OR columns
Choose between sampling pattern with “if |cos(𝜃)| > |sin(𝜃)|”, where 𝜃 is the ray angle
Just two samples?
That’s more like it.
Transposed Matrix
Instead of walking across columns, walk across rows of a transposed image
1
Improvement using transposed matrix
Was: 9 seconds/rotation Now: 2.97 seconds / rotation
32 registers – disabled debugging features, now 100% occupancy
Another way to deal with overflow problems is to break up the whole experiment into parts!
Tweaking Block Size
Was: 2.97 seconds/rotation Now: 1.8 seconds / rotation
32 registers – disabled debugging features
Changed from [16, 8, 1] to [16, 1, 8]
Taking Tally
433
27.78
16.2 12.5 10.3 8.995
2.97
1.8
1
10
100
1000
Naïve removedstruct from
loop
removedclamping
structpointer
block sizeoptimization
sqrt removal transposedmatrix
block sizeoptimization
Tim
e p
er
rota
tio
n [
s]
Forward Projector Performance
Lets fix the first view location for the original benchmark
Taking Correct Tally
733
40.5
22.5 16.4 15.7
3.34
2.02
1
10
100
1000
Naïve removedstruct from
loop
removedclamping
struct pointer sqrt removal+ block size
opt.
TransposedMatrix
block sizeoptimization
Forward Projector Performance
Off by 2x. What next?
Note on performance linearity: 16 views -> 2.1 s/ rotation 800 views -> 2.0 s/ rotation
Guided Profile Analysis
nvvp says latency is the bottleneck
Guided Profile Analysis
now nvvp is telling us that occupancy is the bottleneck
Guided Profile Analysis
Profiler is giving us the run around. Guess it doesn’t know how to improve performance.
Unguided Profile Analysis
The inner most loop is essentially 3D interpolation. What can be done to accelerate these computations?
Texture Memory
Hardware accelerated 8-bit 2D/3D-interpolation
Morton-ordering like schemes are used in texture hardware
1
2
Texture Memory
Texture hardware handles both interpolation computations and boundary checking.
Improvement using textures
.475 seconds / rotation, .429 seconds/rotation with another block size tweak and .64 seconds/rotation including transfer times.
VICTORY
So Sweet
733
40.5
22.5 16.4 15.7
3.34
2.02
0.64
0.1
1
10
100
1000
Naïve removed structfrom loop
removedclamping
struct pointer sqrt removal +block size opt.
TransposedMatrix
block sizeoptimization
image textures
Forward Projector Performance
Verify Outputs
Difference b/w original and fully optimized Sinogram output (reformatted views)
Same results as naïve within +/- 0.7%, but in .6 s instead of 733 seconds. Also ~2800x faster than “reference” CPU implementation! (I think something is wrong with it)
Further GPU Optimization Reading
• Asynchronous compute and transfer
• Shared memory
• Multiple GPUs