cs 179: lecture 4 lab review 2. groups of threads (hierarchy) (largest to smallest) “grid”: ...
TRANSCRIPT
![Page 1: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/1.jpg)
CS 179: Lecture 4Lab Review 2
![Page 2: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/2.jpg)
Groups of Threads (Hierarchy) (largest to smallest)
“Grid”: All of the threads Size: (number of threads per block) * (number of blocks)
“Block”: Size: User-specified
Should at least be a multiple of 32 (often, higher is better) Upper limit given by hardware (512 in Tesla, 1024 in Fermi)
Features: Shared memory Synchronization
![Page 3: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/3.jpg)
Groups of Threads “Warp”:
Group of 32 threads Execute in lockstep
(same instructions)
Susceptible to divergence!
![Page 4: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/4.jpg)
Divergence“Two roads diverged in a wood…
…and I took both”
![Page 5: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/5.jpg)
Divergence
What happens: Executes normally until if-statement Branches to calculate Branch A (blue threads) Goes back (!) and branches to calculate Branch B (red threads)
![Page 6: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/6.jpg)
![Page 7: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/7.jpg)
![Page 8: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/8.jpg)
“Divergent tree”
… 506, 508, 510
Assume 512 threads in block…
… 500, 504, 508
… 488, 496, 504
… 464, 480, 496
![Page 9: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/9.jpg)
“Divergent tree”//Let our shared memory block be partial_outputs[]...
synchronize threads before starting...set offset to 1while ( (offset * 2) <= block dimension):
if (thread index % (offset * 2) is 0): add partial_outputs[thread index + offset] to partial_outputs[thread index]double the offsetsynchronize threads
Get thread 0 to atomicAdd() partial_outputs[0] to output
Assumes block size is power of 2…
![Page 10: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/10.jpg)
“Non-divergent tree”
Example purposes only! Real blocks are way bigger!
![Page 11: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/11.jpg)
“Non-divergent tree”//Let our shared memory block be partial_outputs[]...set offset to highest power of 2 that’s less than the
block dimension
while (offset >= 1):if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index]halve the offsetsynchronize threads
Get thread 0 to atomicAdd() partial_outputs[0] to output
Assumes block size is power of 2…
![Page 12: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/12.jpg)
“Divergent tree”Where is the divergence? Two branches:
Accumulate Do nothing
If the second branch does nothing, then where is the performance loss?
![Page 13: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/13.jpg)
![Page 14: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/14.jpg)
“Divergent tree” – Analysis First iteration: (Reduce 512 -> 256):
Warp of threads 0-31: (After calculating polynomial) Thread 0: Accumulate Thread 1: Do nothing Thread 2: Accumulate Thread 3: Do nothing …
Warp of threads 32-63: (same thing!)
… (up to) Warp of threads 480-511
Number of executing warps: 512 / 32 = 16
![Page 15: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/15.jpg)
“Divergent tree” – Analysis Second iteration: (Reduce 256 -> 128):
Warp of threads 0-31: (After calculating polynomial) Threads 0: Accumulate Thread 1-3: Do nothing Thread 4: Accumulate Thread 5-7: Do nothing …
Warp of threads 32-63: (same thing!)
… (up to) Warp of threads 480-511
Number of executing warps: 16 (again!)
![Page 16: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/16.jpg)
“Divergent tree” – Analysis (Process continues, until offset is large enough to separate
warps)
![Page 17: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/17.jpg)
“Non-divergent tree” – Analysis First iteration: (Reduce 512 -> 256): (Part 1)
Warp of threads 0-31: Accumulate
Warp of threads 32-63: Accumulate
… (up to) Warp of threads 224-255
Then what?
![Page 18: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/18.jpg)
“Non-divergent tree” – Analysis First iteration: (Reduce 512 -> 256): (Part 2)
Warp of threads 256-287: Do nothing!
… (up to) Warp of threads 480-511
Number of executing warps: 256 / 32 = 8 (Was 16 previously!)
![Page 19: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/19.jpg)
“Non-divergent tree” – Analysis Second iteration: (Reduce 256 -> 128):
Warp of threads 0-31, …, 96-127: Accumulate
Warp of threads 128-159, …,
480-511 Do nothing!
Number of executing warps: 128 / 32 = 4 (Was 16 previously!)
![Page 20: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/20.jpg)
What happened? “Implicit divergence”
![Page 21: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/21.jpg)
Why did we do this? Performance improvements Reveals GPU internals!
![Page 22: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/22.jpg)
Final Puzzle What happens when the polynomial order increases?
All these threads that we think are competing… are they?
![Page 23: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/23.jpg)
The Real World
![Page 24: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/24.jpg)
In medicine… More sensitive devices -> more data! More intensive algorithms Real-time imaging and analysis
Most are parallelizable problems!
http://www.varian.com
![Page 25: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/25.jpg)
MRI “k-space” – Inverse FFT Real-time and high-resolution imaging
http://oregonstate.edu
![Page 26: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/26.jpg)
CT, PET Low-dose techniques
Safety! 4D CT imaging X-ray CT vs. PET CT
Texture memory!
http://www.upmccancercenter.com/
![Page 27: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/27.jpg)
Radiation Therapy Goal: Give sufficient dose to cancerous cells, minimize dose
to healthy cells More accurate algorithms possible!
Accuracy = safety! 40 minutes -> 10 seconds
http://en.wikipedia.org
![Page 28: CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block)](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649e495503460f94b3c15f/html5/thumbnails/28.jpg)
Notes Office hours:
Kevin: Monday 8-10 PM Ben: Tuesday 7-9 PM Connor: Tuesday 8-10 PM
Lab 2: Due Wednesday (4/16), 5 PM