HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ...

47

Transcript of HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ...

Page 1: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from
Page 2: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

[HKR HotChips-2007]

Page 3: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

6-12 weeks

Page 4: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

FFT

Cartesian Scan Data

(a)

Spiral scan data + Iterative recon:

Fast scan reduces artifacts, iterative reconstruction increases SNR.

Reconstruction requires a lot of computation.

Spiral Scan Data

Iterative

Reconstruction

(c)

Gridding

(b)

(b)

Page 5: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

Courtesy of Keith Thulborn and Ian Atkinson, Center for MR Research, University of Illinois at Chicago

Page 6: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

Compute Q

Acquire Data

Compute FHd

Find ρ

More than

99.5% of time

Haldar, et al, “Anatomically-constrained reconstruction from noisy data,” MR in Medicine.

Reconstruction of a 643 image used to

take days!

Page 7: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

Performance: 128 GFLOPS

Time: 1.2 minutes

Page 8: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

for (m = 0; m < M; m++) {

phi[m] = rPhi[m]*rPhi[m]

+ iPhi[m]*iPhi[m]

for (n = 0; n < N; n++) {

exp = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n])

rQ[n] += phi[m]*cos(exp)

iQ[n] += phi[m]*sin(exp)

}

}

6

Page 9: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

7

for (m = 0; m < M; m++) {

phi[m] = rPhi[m]*rPhi[m] +

iPhi[m]*iPhi[m]

for (n = 0; n < N; n++) {

exp = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n])

rQ[n] += phi[m]*cos(exp)

iQ[n] += phi[m]*sin(exp)

}

}

Page 10: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

8

for (m = 0; m < M; m++) {

phi[m] = rPhi[m]*rPhi[m] +

iPhi[m]*iPhi[m]

for (n = 0; n < N; n++) {

exp = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n])

rQ[n] += phi[m]*cos(exp)

iQ[n] += phi[m]*sin(exp)

}

}

for (n = 0; n < N; n++) {

for (m = 0; m < M; m++) {

phi[m] = rPhi[m]*rPhi[m]

+ iPhi[m]*iPhi[m]

exp = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n])

rQ[n] += phi[m]*cos(exp)

iQ[n] += phi[m]*sin(exp)

}

}

Page 11: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

9

for (n = 0; n < N; n++) {

for (m = 0; m < M; m++) {

phi[m] = rPhi[m]*rPhi[m]

+ iPhi[m]*iPhi[m]

exp = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n])

rQ[n] += phi[m]*cos(exp)

iQ[n] += phi[m]*sin(exp)

}

}

for (m = 0; m < M; m++) {

phi[m] = rPhi[m]*rPhi[m]

+ iPhi[m]*iPhi[m]

}

for (n = 0; n < N; n++) {

for (m = 0; m < M; m++) {

exp = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n])

rQ[n] += phi[m]*cos(exp)

iQ[n] += phi[m]*sin(exp)

}

}

Page 12: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

10

for (m = 0; m < M; m++) {

phi[m] = rPhi[m]*rPhi[m]

+ iPhi[m]*iPhi[m]

}

for (n = 0; n < N; n++) {

for (m = 0; m < M; m++) {

exp = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n])

rQ[n] += phi[m]*cos(exp)

iQ[n] += phi[m]*sin(exp)

}

}

}

}

Page 13: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

for (m = 0; m < M/32; m++) {

exp = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n])

rQ[n] += phi[m]*cos(exp)

iQ[n] += phi[m]*sin(exp)

}

12

Page 14: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

12

for (m = 31M/32; m < 32M/32; m++)

{

exp = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n])

rQ[n] += phi[m]*cos(exp)

iQ[n] += phi[m]*sin(exp)

}

Page 15: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

13

Q(float* x,y,z,rQ,iQ,kx,ky,kz,phi,

int startM,endM)

{

n = blockIdx.x*TPB + threadIdx.x

for (m = startM; m < endM; m++) {

exp = 2*PI*(kx[m]*x[n]

+ ky[m]*y[n]

+ kz[m]*z[n])

rQ[n] += phi[m] * cos(exp)

iQ[n] += phi[m] * sin(exp)

}

}

Page 16: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

14

Page 17: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

15

Page 18: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

16

Page 19: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

17

Page 20: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

18

Page 21: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

19

Page 22: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

20

A.N. Netravali and B.G. Haskell, Digital Pictures: Representation, Compression, and Standards (2nd Ed), Plenum Press, New York, NY (1995).

Page 23: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

21

Page 24: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

22

Page 25: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

Increase in per-thread performance, but fewer threads:

Lower overall performance 23

Page 26: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

24

Page 27: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

25

Page 28: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

26

Page 29: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

27

Page 30: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

28

Page 31: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

29

8X

Page 32: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

30 108X 228X 357X

Page 33: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from
Page 34: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from
Page 35: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

• Programmers are

doing too much

heavy lifting

• Too many memory

organizational

details are

exposed to the programmers

Sum of Absolute

Differences

S. Ryoo, et al, “Program Optimization Space Pruning for a Multithreaded GPU, ACM

/IEEE CGO, April 2008.

Page 36: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

Sum of Absolute

Differences

By selecting only

Pareto-optimal points,

we pruned the search space by 98% and still

found the optimal configuration

S. Ryoo, et al, “Program Optimization Space Pruning for a Multithreaded GPU, ACM

/IEEE CGO, April 2008.

Page 37: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

IA multi-core

& Larrabe NVIDIA GPU

NVIDIA

SDK 1.1

MCUDA/

OpenMP

CUDA-lite

CUDA-tune

CUDA-auto

1st generation CUDA programming

with explicit, hardwired thread

organizations and explicit

management of memory types and

data transfers

Parameterized CUDA programming using

auto-tuning and optimization space

pruning

Locality annotation programming to

eliminate need for explicit management of

memory types and data transfers

Implicitly parallel programming with data

structure and function property

annotations to enable auto parallelization

Page 38: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from
Page 39: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

S.S. Stone, et al, “Accelerating Advanced MRI Reconstruction using

GPUs,” ACM Computing Frontier Conference 2008, Italy, May 2008.

10 kernels, less

than 1.5 min

after acceleration

Page 40: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from
Page 41: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from
Page 42: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from
Page 43: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

Matrixmul(A[ ], B[ ], C[ ])

{

__shared__ Asub[ ][ ], Bsub[ ][ ];

int a,b,c;

float Csub;

int k;

for(…)

{

Asub[tx][ty] = A[a];

Bsub[tx][ty] = B[b];

__syncthreads();

for( k = 0; k < blockDim.x; k++ )

Csub += Asub[ty][k] + Bsub[k][tx];

__syncthreads();

}

}

Page 44: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

Matrixmul(A[ ], B[ ], C[ ])

{

__shared__ Asub[ ][ ], Bsub[ ][ ];

int a,b,c;

float Csub;

int k;

for(…)

{

for(ty=0; ty < blockDim.y; ty++)

for(tx=0; tx < blockDim.x; tx++)

{

Asub[tx][ty] = A[a];

Bsub[tx][ty] = B[b];

}

for(ty=0; ty < blockDim.y; ty++)

for(tx=0; tx < blockDim.x; tx++)

{

for( k = 0; k < blockDim.x; k++ )

Csub += Asub[ty][k] + Bsub[k][tx];

}

}

}

Page 45: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

•  Consistent speed-up over hand-tuned single-thread code

•  Best optimizations for GPU and CPU not always the same

*Over hand-optimized CPU

**Intel MKL, multi-core execution

Page 46: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from
Page 47: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from

Thank you! Any questions?