cuda dynamic parallelism - nvidia · 2014. 4. 18. · the first focus will be on how dynamic...
TRANSCRIPT
![Page 1: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/1.jpg)
CUDA Dynamic Parallelism
A Debugger Developer's Take on the
Kernel of a Revolution
![Page 2: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/2.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
What are we talking about?
• Problem: recursion & similar o Solution: CUDA dynamic parallelism
• Problem: debugging CUDA is hard o Dynamic or not: it’s still hard
o Solution: TotalView
![Page 3: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/3.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Your speaker
• Rogue Wave Software o Since 1989
o Tools.h++ - the proto-C++ Standard Library
o Acquired TotalView in 2009
• Larry Edelstein o Around even longer
o Salesforce.com, Lotus, CNET, Klout
o Technical sales and solutions architecture
![Page 4: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/4.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Some workloads are so hard (for CUDA)!
• Parallel tasks that create more parallel tasks
• Parallel recursive tasks
![Page 5: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/5.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Quicksort
• Partition the array
• Recurse
![Page 6: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/6.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Quicksort
Source:
http://blogs.nvidia.com/blog/2012/09/12/how-
tesla-k20-speeds-up-quicksort-a-familiar-
comp-sci-code/
![Page 7: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/7.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
How do you parallelize quicksort?
• Save tasks on stack
o Code complexity - shared CPU-GPU work stack
• Run a stage at a time
• Synch after each stage o Costly: short sorts must wait for long sorts
![Page 8: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/8.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Need a better way
• Dynamic workloads
• Move logic into kernel
• Recurse within kernel
![Page 9: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/9.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Dynamic parallelism
• Introduced in CUDA 5.0
• Familiar syntax: __global__ void myKernel(..) {
doWork();
myOtherKernel<<<(x,y)>>>(..);
doMoreWork();
}
![Page 10: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/10.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Not dynamic Dynamic
(plus all the code required to share a stack
between CPU and GPU)
![Page 11: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/11.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Performance
![Page 12: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/12.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
That’s great!
but
![Page 13: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/13.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Debugging CUDA is a challenge
• Two separate realms of processing
• Highly parallel
• Dynamic o It’s a complex graph of grids
• Call stack? Not exactly.
• Steer using logical and device coordinates
![Page 14: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/14.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
If we had a debugger that could...
• Show me the active kernels on the device
• Let me set a breakpoint in any kernel
• Help me navigate from kernel to kernel
• Tell me the relationships between kernels
![Page 15: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/15.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
TotalView with CUDA support
• CUDA and host code display the same
• Set breakpoints, see variables
• Control execution as much as possible o control by warp
• Navigate device threads o logical coordinates
o device coordinates
![Page 16: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/16.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
TotalView with dynamic support
• TotalView can debug CUDA Dynamic programs using
the CUDA 5.5 toolchain and runtime
• Dynamically launched CUDA kernels say which kernels
launched them (parent kernels)
![Page 17: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/17.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
TotalView details
• Linux, Unix, and Mac OS X
• C/C++ and Fortran
![Page 18: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/18.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
![Page 19: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/19.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Device status display
![Page 20: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/20.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Questions
![Page 21: CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going](https://reader035.vdocuments.net/reader035/viewer/2022070111/60517e944a933c476e74e5e7/html5/thumbnails/21.jpg)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Acknowledgements
• http://blogs.nvidia.com/blog/2012/09/12/how-
tesla-k20-speeds-up-quicksort-a-familiar-
comp-sci-code/
• https://www.hackerrank.com/challenges/quic
ksort2