CUDA accelerated optimization for real-time...

61
2018-03-13 Page 1 Ismayil Guracar/ HC US PLM II © Siemens Healthcare, 2017 CUDA accelerated optimization for real-time diagnostic ultrasound medical imaging motion tracking Ismayil Guracar S8233 GTC 2018 Tuesday, March 27, 2018

Transcript of CUDA accelerated optimization for real-time...

  • 2018-03-13 Page 1 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    CUDA accelerated optimization for real-time diagnostic ultrasound medical imaging motion tracking

    Ismayil Guracar S8233 GTC 2018 Tuesday, March 27, 2018

  • 2018-03-13 Page 2 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017 © 2018 Siemens. All Rights Reserved.

    Diagnostic Ultrasound Imaging Equipment

    A machine for the

    acquisition of

    imaging information

    to affect diagnosis

    and treatment

    2

  • 2018-03-13 Page 3 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Ultrasound B-mode (Brightness Mode) Imaging

  • 2018-03-13 Page 4 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Ultrasound Contrast Imaging Mode Imaging Microbubbles

    Contrast Image B-mode Image

  • 2018-03-13 Page 5 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Ultrasound Contrast Agents Overview

    Gas filled microbubbles with phospholipid shells: 1 µm diameter compare to red blood cells 6-8 µm

    Injected into bloodstream, (intravenous)

    Agents confined to vascular tree unlike MR and CT agents which “leak” into tissue interstitial spaces

    Excellent safety profile. Commonly used in clinical practice Visible with ultrasound with excellent sensitivity and specificity using special

    processing which is sensitive to non-linearities in bubble acoustic response

    Destroyed in local region with relatively high power burst of ultrasound energy (still within diagnostic levels)

  • 2018-03-13 Page 6 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Ultrasound Contrast Maximum Intensity Projection Capture

    Creates a high quality

    image from 10’s – 100’s of

    component images

    Provides “vascular road

    mapping” highlighting the

    path ultrasound contrast

    agents take through the

    vascular tree

    Patient holds breath during

    10-15 second acquisition

    But patients can’t always hold

    their breath long enough

    Contrast Image B-mode Image

  • 2018-03-13 Page 7 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Motion Stabilized Maximum Intensity Projection

    Tracks the B-mode image and stabilizes contrast while finding maximum

    signal at each pixel location (MIP)

    Without Motion Compensation With Motion Compensation

  • 2018-03-13 Page 8 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Motion Stabilized Maximum Intensity Projection (MIP) Signal Flow

    Scan

    Conv

    SAD

    Track

    buffer

    Ref.

    buffer

    Motion

    estimate

    Scan

    Conv MIP

    buffer MAX

    Ultrasound

    B-Mode

    Ultrasound

    Contrast To Display

    Δx, Δy, θ

    1st frame

    of capture

    Tracking ROIs locations

  • 2018-03-13 Page 9 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Rigid motion computation

    Let 𝑝𝑖 and 𝑞𝑖 be two sets of 𝑛 points in R2

    We want to compute the optimal translation 𝒕 and rotation 𝑅 that minimize

    ω𝑖 𝑅𝒑𝒊 + 𝒕 − 𝒒𝒊 2

    𝑛

    𝑘=0

    Where 𝜔𝑖 are weights for each point pair.

    Find the best fitting rigid transformation that aligns two sets of corresponding points-- in our case the reference set of ROI centers 𝒑, to be matched to displaced input 𝒒

    𝒑 𝒒

  • 2018-03-13 Page 10 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Sum of Absolute Differences Block Match

    For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images

    2 5

    4 0

    3 5 6 7 9

    1 2 5 5 7

    1 4 0 7 5

    4 7 5 9 4

    Reference

    template

    Search

    image

  • 2018-03-13 Page 11 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Sum of Absolute Differences Block Match

    For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images

    2 5

    4 0

    3 5 6 7 9

    1 2 5 5 7

    1 4 0 7 5

    4 7 5 9 4

    𝟐 − 𝟑 + 𝟓 − 𝟓 + 𝟒 − 𝟏 + 𝟎 − 𝟐 = 𝟔

    6

    Reference

    template

    Search

    image

    Sum of absolute differences

  • 2018-03-13 Page 12 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Sum of Absolute Differences Block Match

    For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images

    2 5

    4 0

    3 5 6 7 9

    1 2 5 5 7

    1 4 0 7 5

    4 7 5 9 4

    𝟐 − 𝟓 + 𝟓 − 𝟔 + 𝟒 − 𝟐 + 𝟎 − 𝟓 = 𝟏𝟏

    6

    Reference

    template

    Search

    image

    Sum of absolute differences

    11

  • 2018-03-13 Page 13 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Sum of Absolute Differences Block Match

    For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images

    2 5

    4 0

    3 5 6 7 9

    1 2 5 5 7

    1 4 0 7 5

    4 7 5 9 4

    𝟐 − 𝟔 + 𝟓 − 𝟕 + 𝟒 − 𝟓 + 𝟎 − 𝟓 = 𝟏𝟐

    6

    Reference

    template

    Search

    image

    Sum of absolute differences

    11 12

  • 2018-03-13 Page 14 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Sum of Absolute Differences Block Match

    For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images

    2 5

    4 0

    3 5 6 7 9

    1 2 5 5 7

    1 4 0 7 5

    4 7 5 9 4

    𝟐 − 𝟕 + 𝟓 − 𝟗 + 𝟒 − 𝟓 + 𝟎 − 𝟕 = 𝟏𝟕

    6

    Reference

    template

    Search

    image

    Sum of absolute differences

    11 12 17

  • 2018-03-13 Page 15 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Sum of Absolute Differences Block Match

    For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images

    2 5

    4 0

    3 5 6 7 9

    1 2 5 5 7

    1 4 0 7 5

    4 7 5 9 4

    𝟐 − 𝟏 + 𝟓 − 𝟐 + 𝟒 − 𝟏 + 𝟎 − 𝟒 = 𝟏𝟏

    6

    Reference

    template

    Search

    image

    Sum of absolute differences

    11 12 17

    11

  • 2018-03-13 Page 16 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Sum of Absolute Differences Block Match

    For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images

    2 5

    4 0

    3 5 6 7 9

    1 2 5 5 7

    1 4 0 7 5

    4 7 5 9 4

    𝟐 − 𝟐 + 𝟓 − 𝟓 + 𝟒 − 𝟒 + 𝟎 − 𝟎 = 𝟎

    6

    Reference

    template

    Search

    image

    Sum of absolute differences

    11 12 17

    11 0

  • 2018-03-13 Page 17 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Sum of Absolute Differences Block Match

    For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images

    2 5

    4 0

    3 5 6 7 9

    1 2 5 5 7

    1 4 0 7 5

    4 7 5 9 4

    𝟐 − 𝟓 + 𝟓 − 𝟓 + 𝟒 − 𝟎 + 𝟎 − 𝟕 = 𝟏𝟒

    6

    Reference

    template

    Search

    image

    Sum of absolute differences

    11 12 17

    11 0 14

  • 2018-03-13 Page 18 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Sum of Absolute Differences Block Match

    For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images

    2 5

    4 0

    3 5 6 7 9

    1 2 5 5 7

    1 4 0 7 5

    4 7 5 9 4

    𝟐 − 𝟓 + 𝟓 − 𝟕 + 𝟒 − 𝟕 + 𝟎 − 𝟓 = 𝟏𝟑

    6

    Reference

    template

    Search

    image

    Sum of absolute differences

    11 12 17

    11 0 14 13

  • 2018-03-13 Page 19 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Sum of Absolute Differences Block Match

    For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images

    2 5

    4 0

    3 5 6 7 9

    1 2 5 5 7

    1 4 0 7 5

    4 7 5 9 4

    𝟐 − 𝟏 + 𝟓 − 𝟒 + 𝟒 − 𝟒 + 𝟎 − 𝟕 = 𝟗

    6

    Reference

    template

    Search

    image

    Sum of absolute differences

    11 12 17

    11 0 14 13

    9

  • 2018-03-13 Page 20 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Sum of Absolute Differences Block Match

    For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images

    2 5

    4 0

    3 5 6 7 9

    1 2 5 5 7

    1 4 0 7 5

    4 7 5 9 4

    𝟐 − 𝟒 + 𝟓 − 𝟎 + 𝟒 − 𝟕 + 𝟎 − 𝟓 = 𝟏𝟓

    6

    Reference

    template

    Search

    image

    Sum of absolute differences

    11 12 17

    11 0 14 13

    9 15

  • 2018-03-13 Page 21 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Sum of Absolute Differences Block Match

    For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images

    2 5

    4 0

    3 5 6 7 9

    1 2 5 5 7

    1 4 0 7 5

    4 7 5 9 4

    𝟐 − 𝟎 + 𝟓 − 𝟕 + 𝟒 − 𝟓 + 𝟎 − 𝟗 = 𝟏𝟒

    6

    Reference

    template

    Search

    image

    Sum of absolute differences

    11 12 17

    11 0 14 13

    9 15 14

  • 2018-03-13 Page 22 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Sum of Absolute Differences Block Match

    For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images

    2 5

    4 0

    3 5 6 7 9

    1 2 5 5 7

    1 4 0 7 5

    4 7 5 9 4

    𝟐 − 𝟕 + 𝟓 − 𝟓 + 𝟒 − 𝟗 + 𝟎 − 𝟒 = 𝟏𝟒

    6

    Reference

    template

    Search

    image

    Sum of absolute differences

    11 12 17

    11 0 14 13

    9 15 14 14

  • 2018-03-13 Page 23 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Sum of Absolute Differences Block Match

    For each tracking ROI, find the x and y displacement that minimizes the sum of absolute differences between the search and reference images

    2 5

    4 0

    3 5 6 7 9

    1 2 5 5 7

    1 4 0 7 5

    4 7 5 9 4

    𝟐 − 𝟑 + 𝟓 − 𝟓 + 𝟒 − 𝟒 + 𝟎 − 𝟎 = 𝟎

    6

    Reference

    template

    Search

    image

    Sum of absolute differences

    11 12 17

    11 0 14 13

    9 15 14 14

    Minimum SAD indicates

    displacement (+1, +1) is best fit!

  • 2018-03-13 Page 24 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Block Match

    1) Highly parallelizable, for example

    • 1024 threads arranged in a 32×32 2D thread block to compute SAD

    • Each thread computes SAD for a particular displacement over the ROI

    i.e. thread 0 computes SAD by summing absolute differences between image1 and image2 over a region of interest (i.e. 64x64 pixels) with image2 displaced by (-16,-16) Cartesian pixels and thread 1023 uses a displacement of (+15,+15)

    2) The array of SAD values for each ROI are then searched by a subsequent kernel to find the minimum value. The 2D index of this value is the estimated displacement estimate for the ROI

    3) The displacement estimates are performed for multiple ROIs within the image space

  • 2018-03-13 Page 25 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Block match—SAD value array examples each produced by a single 32x32 thread block

  • 2018-03-13 Page 26 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for threadIx=0,0 : accumulation for a single displacement

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 27 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for threadIx=0,0 : accumulation for a single displacement

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 28 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for threadIx=0,0 : accumulation for a single displacement

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 29 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for threadIx=0,0 : accumulation for a single displacement

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 30 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for threadIx=0,0 : accumulation for a single displacement

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 31 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for threadIx=0,0 : accumulation for a single displacement

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 32 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for threadIx=0,0 : accumulation for a single displacement

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 33 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for threadIx=0,0: accumulation for a single displacement

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 34 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for threadIx=0,0: accumulation for a single displacement

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 35 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for threadIx=1,0: accumulation for a single displacement

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 36 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for threadIx=2,0: accumulation for a single displacement

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 37 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for threadIx=0,1 accumulation for a single displacement

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 38 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for threadIx=5,5 accumulation for a single displacement

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-20 Page 39 Ismayil Guracar/ HC US PLM PM

    © Siemens Healthcare GmbH, 2018

    SAD kernel– accumulate SAD values over ROIs

    int xDisplacement = threadIdx.x + xDisplacementOffset;

    int yDisplacement = threadIdx.y + yDisplacementOffset;

    int xOut = threadIdx.x;

    int yOut = threadIdx.y;

    int roiX = d_roiParams[blockIdx.z].roiX;

    int roiY = d_roiParams[blockIdx.z].roiY;

    int sumDiff = 0;

    for (int y = 0; y

  • 2018-03-13 Page 40 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 41 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 42 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-20 Page 43 Ismayil Guracar/ HC US PLM PM

    © Siemens Healthcare GmbH, 2018

    SAD ILP2: process 2 displacements as well as 2 pixels at a time in the inner loop: ACCESS ALIGNED TO SHORT INT

    sumDiffEven = 0;sumDiffEven = 0;

    short2 S1; short4 S2;

    for (int y = 0; y

  • 2018-03-20 Page 44 Ismayil Guracar/ HC US PLM PM

    © Siemens Healthcare GmbH, 2018

    ACCESS ALIGNED TO SHORT, with data reuse

    sumDiffEven = 0;sumDiffOdd = 0;

    for (int y = 0; y

  • 2018-03-13 Page 45 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements short2 aligned read access on short2 aligned ROI 1 and ROI 2

    ROI 1

    reference ROI 2

    tracking

  • 2018-03-13 Page 46 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    ROI 1

    reference ROI 2

    tracking

    SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements short2 aligned read access on short2 aligned ROI 1 and ROI 2

  • 2018-03-13 Page 47 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    ROI 1

    reference ROI 2

    tracking

    SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements short2 aligned read access on short2 aligned ROI 1 and ROI 2

  • 2018-03-20 Page 48 Ismayil Guracar/ HC US PLM PM

    © Siemens Healthcare GmbH, 2018

    ACCESS ALIGNED TO SHORT2, with data reuse

    sumDiffEven = 0;sumDiffOdd = 0;

    for (int y = 0; y

  • 2018-03-13 Page 49 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    ROI 1

    reference ROI 2

    tracking

    SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements short2 aligned read access on short2 misaligned ROI 1 and ROI 2

  • 2018-03-13 Page 50 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    ROI 1

    reference ROI 2

    tracking

    SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements short2 aligned read access on short2 misaligned ROI 1 and ROI 2

  • 2018-03-13 Page 51 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    ROI 1

    reference ROI 2

    tracking

    SAD kernel memory access pattern for thread0: ILPx2 accumulation for a two consecutive x displacements short2 aligned read access on short2 misaligned ROI 1 and ROI 2

  • 2018-03-20 Page 52 Ismayil Guracar/ HC US PLM PM

    © Siemens Healthcare GmbH, 2018

    ACCESS ALIGNED TO SHORT2

    index1 = roiX + (y + roiY)*yPitch;

    index2 = roiX + xDisplacement + (y + roiY + yDisplacement)*yPitch;

    if ((index1 & 1)==0) && ((index2 & 1)==0) // both aligned

    {

    short2 S2 = __ldg((short2*)&in2[index2]);

    for (int x = 0; x

  • 2018-03-20 Page 53 Ismayil Guracar/ HC US PLM PM

    © Siemens Healthcare GmbH, 2018

    ENSURE ACCESS ALIGNED TO SHORT2 even if indices are not on short2 boundaries

    else if ((index1 & 1) == 0) && (index2 & 1)==1)

    {// in1 aligned, use index2 offset down by 1 pixel and adjust sad values

    index2 -= 1;

    … access data using index1 and index2

    sumDiffEven.x = __sad(S1.x, S2.y, sumDiffEven.x);

    sumDiffEven.y = __sad(S1.y, S2next.x, sumDiffEven.y);

    sumDiffOdd.x = __sad(S1.x, S2next.x, sumDiffOdd.x);

    sumDiffOdd.y = __sad(S1.y, S2next.y, sumDiffOdd.y);

    }

  • 2018-03-20 Page 54 Ismayil Guracar/ HC US PLM PM

    © Siemens Healthcare GmbH, 2018

    ENSURE ACCESS ALIGNED TO SHORT2 even if indices are not on short2 boundaries

    if ((index1 & 1)==1) && ((index2 & 1)==0)

    {// in1 not aligned, use index1 offset down by 1 pixel for short2 access

    index1--;

    … access data using index1 and index2

    sumDiffEven.x = __sad(S1.y, S2.x, sumDiffEven.x);

    sumDiffEven.y = __sad(S1next.x, S2.y, sumDiffEven.y);

    sumDiffOdd.x = __sad(S1next.x, S2.y, sumDiffOdd.x);

    sumDiffOdd.y = __sad(S1next.y, S2next.x, sumDiffOdd.y);

    }

    else if ((index1 & 1) == 1) && (index2 & 1)==1)

    {// both not aligned, use index1 and index2 offset down by 1 pixel for short2 access

    index1--; index2--;

    … access data using index1 and index2

    sumDiffEven.x = __sad(S1.y, S2.y, sumDiffEven.x);

    sumDiffEven.y = __sad(S1next.x, S2next.x, sumDiffEven.y);

    sumDiffOdd.x = __sad(S1next.x, S2next.x, sumDiffOdd.x);

    sumDiffOdd.y = __sad(S1next.y, S2next.y, sumDiffOdd.y);

    }

  • 2018-03-13 Page 55 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    The value of ILP and consolidated memory access Performance improvements in SAD calculation

    SAD Kernel SAD search execution time

    With P4000 GPU

    192 ROIs, each 32x32 in size, with

    32x32 search area

    No ILP, short access per thread

    32x32 threads per block/ROI

    2003 µsec

    ILP×2, serial short access per thread 16x32 threads per block/ROI

    959 µsec

    ILP×2, combined and aligned short2 access per thread 16x32 threads with 4 alignment combinations

    548 µsec

    ILP×4, combined and aligned short4 access per thread 8x32 threads with 16 alignment combinations

    490 µsec

  • 2018-03-13 Page 56 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Nsight measurements

  • 2018-03-13 Page 57 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    No ILP

    ILPx2, serial short

    ILPx2, short2

    ILPx4, short4

    32x32 threads/block

    32x2=64 bytes read/warp

    16x32 threads/block

    32x4=128 bytes read/warp

    16x32 threads/block

    32x4=128 bytes read/warp

    8x32 threads/block

    32x8=256 bytes read/warp

  • 2018-03-13 Page 58 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    __ldg instruction

    Use to explicitly route global reads though the texture cache Some improvement in SAD kernels seen compared to global read

    __ldg() improves ILP×2 SAD performance by 7% cache hit rate increases from 93% to 99%

  • 2018-03-13 Page 59 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    An Alternative to CUDA kernels

    Use built-in motion estimation hardware in the GPU Free built-in hardware used for H264 encoding Motion estimation vector results exposed via NVENC API Motion estimation only mode available on Maxwell HW or later

    Disadvantages Inflexible ROI placement Limited search area Requires host-synchronization, which makes it difficult to integrate into a real

    time CUDA pipeline

  • 2018-03-13 Page 60 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Conclusions and Observations

    L1/texture cache easy to put into use via the __ldg intrinsic instruction versus shared memory which requires careful coordination among threads, attention to bank conflicts, and careful use of tiling to fully exploit the limited shared memory size

    ILP is important to maximizing performance Further improvement in reducing memory transaction count by consolidating

    access while maintaining alignment i.e. short short2 short4

    Handling all possible alignment cases separately will increase kernel code size and coding effort, but the payoff may be greatly improved performance.

  • 2018-03-13 Page 61 Ismayil Guracar/ HC US PLM II

    © Siemens Healthcare, 2017

    Thank you for your attention and questions

    Ismayil Guracar

    Senior Key Expert

    Siemens Medical Solutions, USA Inc.

    Advanced Development - Ultrasound

    Mountain View, California

    [email protected]