CUDA accelerated optimization for real-time...
Transcript of CUDA accelerated optimization for real-time...
-
2018-03-13 Page 1 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
CUDA accelerated optimization for real-time diagnostic ultrasound medical imaging motion tracking
Ismayil Guracar S8233 GTC 2018 Tuesday, March 27, 2018
-
2018-03-13 Page 2 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017 © 2018 Siemens. All Rights Reserved.
Diagnostic Ultrasound Imaging Equipment
A machine for the
acquisition of
imaging information
to affect diagnosis
and treatment
2
-
2018-03-13 Page 3 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Ultrasound B-mode (Brightness Mode) Imaging
-
2018-03-13 Page 4 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Ultrasound Contrast Imaging Mode Imaging Microbubbles
Contrast Image B-mode Image
-
2018-03-13 Page 5 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Ultrasound Contrast Agents Overview
Gas filled microbubbles with phospholipid shells: 1 µm diameter compare to red blood cells 6-8 µm
Injected into bloodstream, (intravenous)
Agents confined to vascular tree unlike MR and CT agents which “leak” into tissue interstitial spaces
Excellent safety profile. Commonly used in clinical practice Visible with ultrasound with excellent sensitivity and specificity using special
processing which is sensitive to non-linearities in bubble acoustic response
Destroyed in local region with relatively high power burst of ultrasound energy (still within diagnostic levels)
-
2018-03-13 Page 6 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Ultrasound Contrast Maximum Intensity Projection Capture
Creates a high quality
image from 10’s – 100’s of
component images
Provides “vascular road
mapping” highlighting the
path ultrasound contrast
agents take through the
vascular tree
Patient holds breath during
10-15 second acquisition
But patients can’t always hold
their breath long enough
Contrast Image B-mode Image
-
2018-03-13 Page 7 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Motion Stabilized Maximum Intensity Projection
Tracks the B-mode image and stabilizes contrast while finding maximum
signal at each pixel location (MIP)
Without Motion Compensation With Motion Compensation
-
2018-03-13 Page 8 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Motion Stabilized Maximum Intensity Projection (MIP) Signal Flow
Scan
Conv
SAD
Track
buffer
Ref.
buffer
Motion
estimate
Scan
Conv MIP
buffer MAX
Ultrasound
B-Mode
Ultrasound
Contrast To Display
Δx, Δy, θ
1st frame
of capture
Tracking ROIs locations
-
2018-03-13 Page 9 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Rigid motion computation
Let 𝑝𝑖 and 𝑞𝑖 be two sets of 𝑛 points in R2
We want to compute the optimal translation 𝒕 and rotation 𝑅 that minimize
ω𝑖 𝑅𝒑𝒊 + 𝒕 − 𝒒𝒊 2
𝑛
𝑘=0
Where 𝜔𝑖 are weights for each point pair.
Find the best fitting rigid transformation that aligns two sets of corresponding points-- in our case the reference set of ROI centers 𝒑, to be matched to displaced input 𝒒
𝒑 𝒒
-
2018-03-13 Page 10 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Sum of Absolute Differences Block Match
For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images
2 5
4 0
3 5 6 7 9
1 2 5 5 7
1 4 0 7 5
4 7 5 9 4
Reference
template
Search
image
-
2018-03-13 Page 11 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Sum of Absolute Differences Block Match
For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images
2 5
4 0
3 5 6 7 9
1 2 5 5 7
1 4 0 7 5
4 7 5 9 4
𝟐 − 𝟑 + 𝟓 − 𝟓 + 𝟒 − 𝟏 + 𝟎 − 𝟐 = 𝟔
6
Reference
template
Search
image
Sum of absolute differences
-
2018-03-13 Page 12 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Sum of Absolute Differences Block Match
For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images
2 5
4 0
3 5 6 7 9
1 2 5 5 7
1 4 0 7 5
4 7 5 9 4
𝟐 − 𝟓 + 𝟓 − 𝟔 + 𝟒 − 𝟐 + 𝟎 − 𝟓 = 𝟏𝟏
6
Reference
template
Search
image
Sum of absolute differences
11
-
2018-03-13 Page 13 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Sum of Absolute Differences Block Match
For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images
2 5
4 0
3 5 6 7 9
1 2 5 5 7
1 4 0 7 5
4 7 5 9 4
𝟐 − 𝟔 + 𝟓 − 𝟕 + 𝟒 − 𝟓 + 𝟎 − 𝟓 = 𝟏𝟐
6
Reference
template
Search
image
Sum of absolute differences
11 12
-
2018-03-13 Page 14 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Sum of Absolute Differences Block Match
For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images
2 5
4 0
3 5 6 7 9
1 2 5 5 7
1 4 0 7 5
4 7 5 9 4
𝟐 − 𝟕 + 𝟓 − 𝟗 + 𝟒 − 𝟓 + 𝟎 − 𝟕 = 𝟏𝟕
6
Reference
template
Search
image
Sum of absolute differences
11 12 17
-
2018-03-13 Page 15 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Sum of Absolute Differences Block Match
For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images
2 5
4 0
3 5 6 7 9
1 2 5 5 7
1 4 0 7 5
4 7 5 9 4
𝟐 − 𝟏 + 𝟓 − 𝟐 + 𝟒 − 𝟏 + 𝟎 − 𝟒 = 𝟏𝟏
6
Reference
template
Search
image
Sum of absolute differences
11 12 17
11
-
2018-03-13 Page 16 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Sum of Absolute Differences Block Match
For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images
2 5
4 0
3 5 6 7 9
1 2 5 5 7
1 4 0 7 5
4 7 5 9 4
𝟐 − 𝟐 + 𝟓 − 𝟓 + 𝟒 − 𝟒 + 𝟎 − 𝟎 = 𝟎
6
Reference
template
Search
image
Sum of absolute differences
11 12 17
11 0
-
2018-03-13 Page 17 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Sum of Absolute Differences Block Match
For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images
2 5
4 0
3 5 6 7 9
1 2 5 5 7
1 4 0 7 5
4 7 5 9 4
𝟐 − 𝟓 + 𝟓 − 𝟓 + 𝟒 − 𝟎 + 𝟎 − 𝟕 = 𝟏𝟒
6
Reference
template
Search
image
Sum of absolute differences
11 12 17
11 0 14
-
2018-03-13 Page 18 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Sum of Absolute Differences Block Match
For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images
2 5
4 0
3 5 6 7 9
1 2 5 5 7
1 4 0 7 5
4 7 5 9 4
𝟐 − 𝟓 + 𝟓 − 𝟕 + 𝟒 − 𝟕 + 𝟎 − 𝟓 = 𝟏𝟑
6
Reference
template
Search
image
Sum of absolute differences
11 12 17
11 0 14 13
-
2018-03-13 Page 19 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Sum of Absolute Differences Block Match
For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images
2 5
4 0
3 5 6 7 9
1 2 5 5 7
1 4 0 7 5
4 7 5 9 4
𝟐 − 𝟏 + 𝟓 − 𝟒 + 𝟒 − 𝟒 + 𝟎 − 𝟕 = 𝟗
6
Reference
template
Search
image
Sum of absolute differences
11 12 17
11 0 14 13
9
-
2018-03-13 Page 20 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Sum of Absolute Differences Block Match
For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images
2 5
4 0
3 5 6 7 9
1 2 5 5 7
1 4 0 7 5
4 7 5 9 4
𝟐 − 𝟒 + 𝟓 − 𝟎 + 𝟒 − 𝟕 + 𝟎 − 𝟓 = 𝟏𝟓
6
Reference
template
Search
image
Sum of absolute differences
11 12 17
11 0 14 13
9 15
-
2018-03-13 Page 21 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Sum of Absolute Differences Block Match
For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images
2 5
4 0
3 5 6 7 9
1 2 5 5 7
1 4 0 7 5
4 7 5 9 4
𝟐 − 𝟎 + 𝟓 − 𝟕 + 𝟒 − 𝟓 + 𝟎 − 𝟗 = 𝟏𝟒
6
Reference
template
Search
image
Sum of absolute differences
11 12 17
11 0 14 13
9 15 14
-
2018-03-13 Page 22 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Sum of Absolute Differences Block Match
For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images
2 5
4 0
3 5 6 7 9
1 2 5 5 7
1 4 0 7 5
4 7 5 9 4
𝟐 − 𝟕 + 𝟓 − 𝟓 + 𝟒 − 𝟗 + 𝟎 − 𝟒 = 𝟏𝟒
6
Reference
template
Search
image
Sum of absolute differences
11 12 17
11 0 14 13
9 15 14 14
-
2018-03-13 Page 23 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Sum of Absolute Differences Block Match
For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images
2 5
4 0
3 5 6 7 9
1 2 5 5 7
1 4 0 7 5
4 7 5 9 4
𝟐 − 𝟑 + 𝟓 − 𝟓 + 𝟒 − 𝟒 + 𝟎 − 𝟎 = 𝟎
6
Reference
template
Search
image
Sum of absolute differences
11 12 17
11 0 14 13
9 15 14 14
Minimum SAD indicates
displacement (+1, +1) is best fit!
-
2018-03-13 Page 24 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Block Match
1) Highly parallelizable, for example
• 1024 threads arranged in a 32×32 2D thread block to compute SAD
• Each thread computes SAD for a particular displacement over the ROI
i.e. thread 0 computes SAD by summing absolute differences between image1 and image2 over a region of interest (i.e. 64x64 pixels) with image2 displaced by (-16,-16) Cartesian pixels and thread 1023 uses a displacement of (+15,+15)
2) The array of SAD values for each ROI are then searched by a subsequent kernel to find the minimum value. The 2D index of this value is the estimated displacement estimate for the ROI
3) The displacement estimates are performed for multiple ROIs within the image space
-
2018-03-13 Page 25 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Block match—SAD value array examples each produced by a single 32x32 thread block
-
2018-03-13 Page 26 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for threadIx=0,0 : accumulation for a single displacement
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 27 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for threadIx=0,0 : accumulation for a single displacement
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 28 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for threadIx=0,0 : accumulation for a single displacement
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 29 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for threadIx=0,0 : accumulation for a single displacement
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 30 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for threadIx=0,0 : accumulation for a single displacement
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 31 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for threadIx=0,0 : accumulation for a single displacement
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 32 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for threadIx=0,0 : accumulation for a single displacement
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 33 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for threadIx=0,0: accumulation for a single displacement
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 34 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for threadIx=0,0: accumulation for a single displacement
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 35 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for threadIx=1,0: accumulation for a single displacement
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 36 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for threadIx=2,0: accumulation for a single displacement
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 37 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for threadIx=0,1 accumulation for a single displacement
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 38 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for threadIx=5,5 accumulation for a single displacement
ROI 1
reference ROI 2
tracking
-
2018-03-20 Page 39 Ismayil Guracar/ HC US PLM PM
© Siemens Healthcare GmbH, 2018
SAD kernel– accumulate SAD values over ROIs
int xDisplacement = threadIdx.x + xDisplacementOffset;
int yDisplacement = threadIdx.y + yDisplacementOffset;
int xOut = threadIdx.x;
int yOut = threadIdx.y;
int roiX = d_roiParams[blockIdx.z].roiX;
int roiY = d_roiParams[blockIdx.z].roiY;
int sumDiff = 0;
for (int y = 0; y
-
2018-03-13 Page 40 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 41 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 42 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements
ROI 1
reference ROI 2
tracking
-
2018-03-20 Page 43 Ismayil Guracar/ HC US PLM PM
© Siemens Healthcare GmbH, 2018
SAD ILP2: process 2 displacements as well as 2 pixels at a time in the inner loop: ACCESS ALIGNED TO SHORT INT
sumDiffEven = 0;sumDiffEven = 0;
short2 S1; short4 S2;
for (int y = 0; y
-
2018-03-20 Page 44 Ismayil Guracar/ HC US PLM PM
© Siemens Healthcare GmbH, 2018
ACCESS ALIGNED TO SHORT, with data reuse
sumDiffEven = 0;sumDiffOdd = 0;
for (int y = 0; y
-
2018-03-13 Page 45 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements short2 aligned read access on short2 aligned ROI 1 and ROI 2
ROI 1
reference ROI 2
tracking
-
2018-03-13 Page 46 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
ROI 1
reference ROI 2
tracking
SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements short2 aligned read access on short2 aligned ROI 1 and ROI 2
-
2018-03-13 Page 47 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
ROI 1
reference ROI 2
tracking
SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements short2 aligned read access on short2 aligned ROI 1 and ROI 2
-
2018-03-20 Page 48 Ismayil Guracar/ HC US PLM PM
© Siemens Healthcare GmbH, 2018
ACCESS ALIGNED TO SHORT2, with data reuse
sumDiffEven = 0;sumDiffOdd = 0;
for (int y = 0; y
-
2018-03-13 Page 49 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
ROI 1
reference ROI 2
tracking
SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements short2 aligned read access on short2 misaligned ROI 1 and ROI 2
-
2018-03-13 Page 50 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
ROI 1
reference ROI 2
tracking
SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements short2 aligned read access on short2 misaligned ROI 1 and ROI 2
-
2018-03-13 Page 51 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
ROI 1
reference ROI 2
tracking
SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements short2 aligned read access on short2 misaligned ROI 1 and ROI 2
-
2018-03-20 Page 52 Ismayil Guracar/ HC US PLM PM
© Siemens Healthcare GmbH, 2018
ACCESS ALIGNED TO SHORT2
index1 = roiX + (y + roiY)*yPitch;
index2 = roiX + xDisplacement + (y + roiY + yDisplacement)*yPitch;
if ((index1 & 1)==0) && ((index2 & 1)==0) // both aligned
{
short2 S2 = __ldg((short2*)&in2[index2]);
for (int x = 0; x
-
2018-03-20 Page 53 Ismayil Guracar/ HC US PLM PM
© Siemens Healthcare GmbH, 2018
ENSURE ACCESS ALIGNED TO SHORT2 even if indices are not on short2 boundaries
else if ((index1 & 1) == 0) && (index2 & 1)==1)
{// in1 aligned, use index2 offset down by 1 pixel and adjust sad values
index2 -= 1;
… access data using index1 and index2
sumDiffEven.x = __sad(S1.x, S2.y, sumDiffEven.x);
sumDiffEven.y = __sad(S1.y, S2next.x, sumDiffEven.y);
sumDiffOdd.x = __sad(S1.x, S2next.x, sumDiffOdd.x);
sumDiffOdd.y = __sad(S1.y, S2next.y, sumDiffOdd.y);
}
-
2018-03-20 Page 54 Ismayil Guracar/ HC US PLM PM
© Siemens Healthcare GmbH, 2018
ENSURE ACCESS ALIGNED TO SHORT2 even if indices are not on short2 boundaries
if ((index1 & 1)==1) && ((index2 & 1)==0)
{// in1 not aligned, use index1 offset down by 1 pixel for short2 access
index1--;
… access data using index1 and index2
sumDiffEven.x = __sad(S1.y, S2.x, sumDiffEven.x);
sumDiffEven.y = __sad(S1next.x, S2.y, sumDiffEven.y);
sumDiffOdd.x = __sad(S1next.x, S2.y, sumDiffOdd.x);
sumDiffOdd.y = __sad(S1next.y, S2next.x, sumDiffOdd.y);
}
else if ((index1 & 1) == 1) && (index2 & 1)==1)
{// both not aligned, use index1 and index2 offset down by 1 pixel for short2 access
index1--; index2--;
… access data using index1 and index2
sumDiffEven.x = __sad(S1.y, S2.y, sumDiffEven.x);
sumDiffEven.y = __sad(S1next.x, S2next.x, sumDiffEven.y);
sumDiffOdd.x = __sad(S1next.x, S2next.x, sumDiffOdd.x);
sumDiffOdd.y = __sad(S1next.y, S2next.y, sumDiffOdd.y);
}
-
2018-03-13 Page 55 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
The value of ILP and consolidated memory access Performance improvements in SAD calculation
SAD Kernel SAD search execution time
With P4000 GPU
192 ROIs, each 32x32 in size, with
32x32 search area
No ILP, short access per thread
32x32 threads per block/ROI
2003 µsec
ILP×2, serial short access per thread 16x32 threads per block/ROI
959 µsec
ILP×2, combined and aligned short2 access per thread 16x32 threads with 4 alignment combinations
548 µsec
ILP×4, combined and aligned short4 access per thread 8x32 threads with 16 alignment combinations
490 µsec
-
2018-03-13 Page 56 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Nsight measurements
-
2018-03-13 Page 57 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
No ILP
ILPx2, serial short
ILPx2, short2
ILPx4, short4
32x32 threads/block
32x2=64 bytes read/warp
16x32 threads/block
32x4=128 bytes read/warp
16x32 threads/block
32x4=128 bytes read/warp
8x32 threads/block
32x8=256 bytes read/warp
-
2018-03-13 Page 58 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
__ldg instruction
Use to explicitly route global reads though the texture cache Some improvement in SAD kernels seen compared to global read
__ldg() improves ILP×2 SAD performance by 7% cache hit rate increases from 93% to 99%
-
2018-03-13 Page 59 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
An Alternative to CUDA kernels
Use built-in motion estimation hardware in the GPU Free built-in hardware used for H264 encoding Motion estimation vector results exposed via NVENC API Motion estimation only mode available on Maxwell HW or later
Disadvantages Inflexible ROI placement Limited search area Requires host-synchronization, which makes it difficult to integrate into a real
time CUDA pipeline
-
2018-03-13 Page 60 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Conclusions and Observations
L1/texture cache easy to put into use via the __ldg intrinsic instruction versus shared memory which requires careful coordination among threads, attention to bank conflicts, and careful use of tiling to fully exploit the limited shared memory size
ILP is important to maximizing performance Further improvement in reducing memory transaction count by consolidating
access while maintaining alignment i.e. short short2 short4
Handling all possible alignment cases separately will increase kernel code size and coding effort, but the payoff may be greatly improved performance.
-
2018-03-13 Page 61 Ismayil Guracar/ HC US PLM II
© Siemens Healthcare, 2017
Thank you for your attention and questions
Ismayil Guracar
Senior Key Expert
Siemens Medical Solutions, USA Inc.
Advanced Development - Ultrasound
Mountain View, California