CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic...

21
CUDA Dynamic Parallelism A Debugger Developer's Take on the Kernel of a Revolution

Transcript of CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic...

Page 1: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

CUDA Dynamic Parallelism

A Debugger Developer's Take on the

Kernel of a Revolution

Page 2: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

What are we talking about?

• Problem: recursion & similar o Solution: CUDA dynamic parallelism

• Problem: debugging CUDA is hard o Dynamic or not: it’s still hard

o Solution: TotalView

Page 3: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

Your speaker

• Rogue Wave Software o Since 1989

o Tools.h++ - the proto-C++ Standard Library

o Acquired TotalView in 2009

• Larry Edelstein o Around even longer

o Salesforce.com, Lotus, CNET, Klout

o Technical sales and solutions architecture

Page 4: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

Some workloads are so hard (for CUDA)!

• Parallel tasks that create more parallel tasks

• Parallel recursive tasks

Page 5: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

Quicksort

• Partition the array

• Recurse

Page 6: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

Quicksort

Source:

http://blogs.nvidia.com/blog/2012/09/12/how-

tesla-k20-speeds-up-quicksort-a-familiar-

comp-sci-code/

Page 7: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

How do you parallelize quicksort?

• Save tasks on stack

o Code complexity - shared CPU-GPU work stack

• Run a stage at a time

• Synch after each stage o Costly: short sorts must wait for long sorts

Page 8: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

Need a better way

• Dynamic workloads

• Move logic into kernel

• Recurse within kernel

Page 9: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

Dynamic parallelism

• Introduced in CUDA 5.0

• Familiar syntax: __global__ void myKernel(..) {

doWork();

myOtherKernel<<<(x,y)>>>(..);

doMoreWork();

}

Page 10: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

Not dynamic Dynamic

(plus all the code required to share a stack

between CPU and GPU)

Page 11: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

Performance

Page 12: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

That’s great!

but

Page 13: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

Debugging CUDA is a challenge

• Two separate realms of processing

• Highly parallel

• Dynamic o It’s a complex graph of grids

• Call stack? Not exactly.

• Steer using logical and device coordinates

Page 14: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

If we had a debugger that could...

• Show me the active kernels on the device

• Let me set a breakpoint in any kernel

• Help me navigate from kernel to kernel

• Tell me the relationships between kernels

Page 15: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

TotalView with CUDA support

• CUDA and host code display the same

• Set breakpoints, see variables

• Control execution as much as possible o control by warp

• Navigate device threads o logical coordinates

o device coordinates

Page 16: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

TotalView with dynamic support

• TotalView can debug CUDA Dynamic programs using

the CUDA 5.5 toolchain and runtime

• Dynamically launched CUDA kernels say which kernels

launched them (parent kernels)

Page 17: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

TotalView details

• Linux, Unix, and Mac OS X

• C/C++ and Fortran

Page 18: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

Page 19: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

Device status display

Page 20: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

Questions

Page 21: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

Copyright © 2011 Rogue Wave Software | All Rights Reserved

Acknowledgements

• http://blogs.nvidia.com/blog/2012/09/12/how-

tesla-k20-speeds-up-quicksort-a-familiar-

comp-sci-code/

• https://www.hackerrank.com/challenges/quic

ksort2