graphics processing unit zhenyu ye henk corporaal 5sia0, tu/e, 2015 background image: nvidia titan x...
TRANSCRIPT
![Page 1: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/1.jpg)
Graphics Processing Unit
Zhenyu YeHenk Corporaal
5SIA0, TU/e, 2015Background image: Nvidia Titan X GPU with Maxwell architecture
![Page 2: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/2.jpg)
The GPU-CPU Gap (by NVIDIA)
ref: Tesla GPU Computing Brochure
![Page 3: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/3.jpg)
The GPU-CPU Gap (data)
Historical data on 1403 Intel CPUs (left) and 566 NVIDIA GPUs (right).Reference: Figure 2.3 and 2.4 in the PhD thesis of Cedric Nugteren, TU/e, 2014.“Improving the Progammability of GPU Architectures”http://repository.tue.nl/771987
![Page 4: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/4.jpg)
The GPU-CPU Gap (by Intel)
ref: "Debunking the 100X GPU vs. CPU myth", http://dx.doi.org/10.1145/1815961.1816021
![Page 5: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/5.jpg)
In This Lecture, We Will Find Out
● The architecture of GPUso Single Instruction Multiple Threado Memory hierarchy
● The program model of GPUso The threading hierarchyo Memory space
![Page 6: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/6.jpg)
GPU in Graphics Card
Image: Nvidia GTX 980
![Page 7: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/7.jpg)
GPU in SOC
Source: http://www.chipworks.com/en/technical-competitive-analysis/resources/blog/inside-the-iphone-6-and-iphone-6-plus/
Example:A8 SOC in iphone 6
GPU (speculated):PowerVR GX6450
![Page 8: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/8.jpg)
NVIDIA Fermi (2009)
ref: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
![Page 9: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/9.jpg)
NVIDIA Kepler (2012)
ref: http://www.nvidia.com/object/nvidia-kepler.html
![Page 10: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/10.jpg)
NVIDIA Maxwell (2014)
ref: http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF
![Page 11: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/11.jpg)
Transistor CountMulticoreManycoreSOC
GPU
FPGA
ref: http://en.wikipedia.org/wiki/Transistor_count
Processor Transistor Tech. Node
61-Core Xeon Phi 5,000,000,000 22nm
Xbox One Main SoC 5,000,000,000 28nm
32-Core Sparc M7 10,000,000,000+ 20nm
Processor Transistor Tech. Node
Nvidia GM200 Maxwell 8,100,000,000 28nm
FPGA Transistor Tech. Node
Virtex-7 6,800,000,000 28nm
Virtex-Ultrascale XCVU440 20,000,000,000+ 20nm
![Page 12: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/12.jpg)
What Can 8bn Transistors Do?Render triangles.Billions of triangles per second!
ref: "How GPUs Work", http://dx.doi.org/10.1109/MC.2007.59
![Page 13: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/13.jpg)
And Scientific ComputingTOP500 supercomputer list in Nov. 2015. (http://www.top500.org)
![Page 14: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/14.jpg)
Let's Start with Examples
We will start from C and RISC.
![Page 15: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/15.jpg)
int A[2][4];for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; }}
Let's Start with C and RISC
Assemblycode of inner-loop
lw r0, 4(r1)addi r0, r0, 1sw r0, 4(r1)
Programmer's view of RISC
![Page 16: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/16.jpg)
Most CPUs Have Vector SIMD Units
Programmer's view of a vector SIMD, e.g. SSE.
![Page 17: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/17.jpg)
int A[2][4];for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store}
int A[2][4];for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; }}
Looks like the previous example,but each SSE instruction executes on 4 ALUs.
Let's Program the Vector SIMDUnroll inner-loop to vector operation.
![Page 18: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/18.jpg)
int A[2][4];for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store}
How Do Vector Programs Run?
![Page 19: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/19.jpg)
CUDA Programmer's View of GPUsA GPU contains multiple SIMD Units.
![Page 20: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/20.jpg)
CUDA Programmer's View of GPUsA GPU contains multiple SIMD Units. All of them can access global memory.
![Page 21: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/21.jpg)
What Are the Differences?SSE GPU
Let's start with two important differences:1. GPUs use threads instead of vectors2. GPUs have the "Shared Memory" spaces
![Page 22: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/22.jpg)
Thread Hierarchy in CUDA
GridcontainsThread Blocks
Thread BlockcontainsThreads
![Page 23: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/23.jpg)
Let's Start Again from Cint A[2][4];for(i=0;i<2;i++) for(j=0;j<4;j++) A[i][j]++;
int A[2][4]; kernelF<<<(2,1),(4,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++;}
// define 2x4=8 threads// all threads run same kernel// each thread block has its id// each thread has its id// each thread has different i and j
convert into CUDA
![Page 24: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/24.jpg)
Thread Hierarchy
int A[2][4]; kernelF<<<(2,1),(4,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; }
// define 2x4=8 threads// all threads run same kernel// each thread block has its id// each thread has its id// each thread has different i and j
Example:thread 3 of block 1 operateson element A[1][3]
![Page 25: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/25.jpg)
How Are Threads Scheduled?
![Page 26: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/26.jpg)
Blocks Are Dynamically Scheduled
![Page 27: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/27.jpg)
How Are Threads Executed?int A[2][4]; kernelF<<<(2,1),(4,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; }
mv.u32 %r0, %ctaid.xmv.u32 %r1, %ntid.xmv.u32 %r2, %tid.xmad.u32 %r3, %r2, %r1, %r0ld.global.s32 %r4, [%r3]add.s32 %r4, %r4, 1st.global.s32 [%r3], %r4
// r0 = i = blockIdx.x// r1 = "threads-per-block"// r2 = j = threadIdx.x// r3 = i * "threads-per-block" + j// r4 = A[i][j]// r4 = r4 + 1// A[i][j] = r4
![Page 28: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/28.jpg)
Utilizing Memory Hierarchy
several cycles
100+ cycles
Memoryaccesslatency
![Page 29: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/29.jpg)
kernelF<<<(1,1),(16,16)>>>(A);
__device__ kernelF(A){
i = threadIdx.y;
j = threadIdx.x;
tmp = (A[i-1][j-1] + A[i-1][j] +
… + A[i+1][i+1] ) / 9;
A[i][j] = tmp;
}
Example: Average Filters
Average over a3x3 window fora 16x16 array
Each thread loads 9 elements from global memory.It takes hundreds of cycles.
![Page 30: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/30.jpg)
Utilizing the Shared MemorykernelF<<<(1,1),(16,16)>>>(A);
__device__ kernelF(A){
__shared__ smem[16][16];
i = threadIdx.y;
j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
Average over a3x3 window fora 16x16 array
![Page 31: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/31.jpg)
kernelF<<<(1,1),(16,16)>>>(A);
__device__ kernelF(A){
__shared__ smem[16][16];
i = threadIdx.y;
j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
Utilizing the Shared Memory
allocatesharedmem
Each thread loads one element from global memory.
![Page 32: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/32.jpg)
However, the Program Is IncorrectkernelF<<<(1,1),(16,16)>>>(A);
__device__ kernelF(A){
__shared__ smem[16][16];
i = threadIdx.y;
j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
Hazards!
![Page 33: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/33.jpg)
kernelF<<<(1,1),(16,16)>>>(A);
__device__ kernelF(A){
__shared__ smem[16][16];
i = threadIdx.y;
j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
Let's See What's WrongAssume 256 threads are scheduled on 8 PEs.
Before load instruction
![Page 34: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/34.jpg)
kernelF<<<(1,1),(16,16)>>>(A);
__device__ kernelF(A){
__shared__ smem[16][16];
i = threadIdx.y;
j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
Let's See What's WrongAssume 256 threads are scheduled on 8 PEs.
Some threads finish the load earlier than others.
![Page 35: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/35.jpg)
kernelF<<<(1,1),(16,16)>>>(A);
__device__ kernelF(A){
__shared__ smem[16][16];
i = threadIdx.y;
j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
Let's See What's WrongAssume 256 threads are scheduled on 8 PEs.
Some elements in the window are not yet loaded by other threads. Error!
![Page 36: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/36.jpg)
kernelF<<<(1,1),(16,16)>>>(A);
__device__ kernelF(A){
__shared__ smem[16][16];
i = threadIdx.y;
j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
How To Solve It?Assume 256 threads are scheduled on 8 PEs.
![Page 37: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/37.jpg)
kernelF<<<(1,1),(16,16)>>>(A);
__device__ kernelF(A){
__shared__ smem[16][16];
i = threadIdx.y;
j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
__SYNC();
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
Use a "SYNC" barrierAssume 256 threads are scheduled on 8 PEs.
![Page 38: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/38.jpg)
kernelF<<<(1,1),(16,16)>>>(A);
__device__ kernelF(A){
__shared__ smem[16][16];
i = threadIdx.y;
j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
__SYNC();
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
Use a "SYNC" barrierAssume 256 threads are scheduled on 8 PEs.
Wait until all threads hit barrier
![Page 39: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/39.jpg)
kernelF<<<(1,1),(16,16)>>>(A);
__device__ kernelF(A){
__shared__ smem[16][16];
i = threadIdx.y;
j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
__SYNC();
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
Use a "SYNC" barrierAssume 256 threads are scheduled on 8 PEs.
All elements in the window are loaded when each thread starts averaging.
![Page 40: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/40.jpg)
Review What We Have Learned
1. Single Instruction Multiple Thread (SIMT)2. Shared memory
Q: What are the pros and cons of explicitly managed memory?
Q: What are the fundamental differences between SIMT and vector SIMD programming model?
![Page 41: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/41.jpg)
Take the Same Example Again
Average over a3x3 window fora 16x16 array
Assume vector SIMD and SIMT both have shared memory.What are the differences?
![Page 42: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/42.jpg)
int A[16][16]; // A in global memory__shared__ int B[16][16]; // B in shared mem
for(i=0;i<16;i++){ for(j=0;i<4;j+=4){
movups xmm0, [ &A[i][j] ]
movups [ &B[i][j] ], xmm0 }}for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps xmm1, [ &B[i-1][j-1] ]
addps xmm1, [ &B[i-1][j] ]
… divps xmm1, 9 }}for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps [ &A[i][j] ], xmm1 }}
kernelF<<<(1,1),(16,16)>>>(A);
__device__ kernelF(A){
__shared__ smem[16][16];
i = threadIdx.y;
j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
__sync(); // threads wait at barrier
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
Vector SIMD v.s. SIMT
(1)
(1)
(2) (2)
(3)
(3)
(1) load to shared mem(2) compute(3) store to global mem
![Page 43: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/43.jpg)
int A[16][16]; // A in global memory__shared__ int B[16][16]; // B in shared mem
for(i=0;i<16;i++){ for(j=0;i<4;j+=4){
movups xmm0, [ &A[i][j] ]
movups [ &B[i][j] ], xmm0 }}for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps xmm1, [ &B[i-1][j-1] ]
addps xmm1, [ &B[i-1][j] ]
… divps xmm1, 9 }}for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps [ &A[i][j] ], xmm1 }}
kernelF<<<(1,1),(16,16)>>>(A);
__device__ kernelF(A){
__shared__ smem[16][16];
i = threadIdx.y;
j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
__sync(); // threads wait at barrier
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
Vector SIMD v.s. SIMT
(a)(b)
(c)
(d)
(a) HW vector width explicit to programmer(b) HW vector width transparent to programmers(c) each vector executed by all PEs in lock step(d) threads executed out of order, need explicit sync
![Page 44: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/44.jpg)
Review What We Have Learned
Programmers convert data level parallelism (DLP) into thread level parallelism (TLP).
?
![Page 45: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/45.jpg)
HW Groups Threads Into Warps
Example: 32 threads per warp
![Page 46: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/46.jpg)
Execution of Threads and Warps
Let's start with an example.
![Page 47: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/47.jpg)
Example of ImplementationNote: NVIDIA may use a more complicated implementation.See patent: US 8555035 B1
![Page 48: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/48.jpg)
Example of Register AllocationAssumption:register file has 32 laneseach warp has 32 threadseach thread uses 8 registers
Note: NVIDIA may use a more complicated allocation method. See patent: US 7634621 B1, “Register file allocation”.
Acronym:“T”: thread number“R”: register number
![Page 49: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/49.jpg)
ExampleAddress : Instruction0x0004 : add r0, r1, r20x0008 : sub r3, r4, r5
Assume:two data pathswarp 0 on left data path warp 1 on right data path
Acronyms:“AGU”: address generation unit“r”: register in a thread“w”: warp number
![Page 50: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/50.jpg)
Read Src Op 1Address : Instruction0x0004 : add r0, r1, r20x0008 : sub r3, r4, r5
Read source operands:r1 for warp 0r4 for warp 1
![Page 51: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/51.jpg)
Buffer Src Op 1Address : Instruction0x0004 : add r0, r1, r20x0008 : sub r3, r4, r5
Push ops to op collector:r1 for warp 0r4 for warp 1
![Page 52: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/52.jpg)
Read Src Op 2Address : Instruction0x0004 : add r0, r1, r20x0008 : sub r3, r4, r5
Read source operands:r2 for warp 0r5 for warp 1
![Page 53: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/53.jpg)
Buffer Src Op 2Address : Instruction0x0004 : add r0, r1, r20x0008 : sub r3, r4, r5
Push ops to op collector:r2 for warp 0r5 for warp 1
![Page 54: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/54.jpg)
Execute Stage 1Address : Instruction0x0004 : add r0, r1, r20x0008 : sub r3, r4, r5
Compute the first 16 threads in the warp.
![Page 55: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/55.jpg)
Execute Stage 2Address : Instruction0x0004 : add r0, r1, r20x0008 : sub r3, r4, r5
Compute the last 16 threads in the warp.
![Page 56: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/56.jpg)
Write BackAddress : Instruction0x0004 : add r0, r1, r20x0008 : sub r3, r4, r5
Write back:r0 for warp 0r3 for warp 1
![Page 57: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/57.jpg)
Recap
What we have learned so far:● Pros and cons of massive threading● How threads are executed on GPUs
![Page 58: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/58.jpg)
What Are the Possible Hazards?
Three types of hazards we have learned earlier: ●Structural hazards●Data hazards●Control hazards
![Page 59: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/59.jpg)
Structural Hazards
In practice, structural hazards need to be handled dynamically.
![Page 60: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/60.jpg)
Data Hazards
In practice, we have a much longer pipeline.
![Page 61: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/61.jpg)
Control Hazards
We cannot travel back in time. What should we do?
![Page 62: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/62.jpg)
How Do GPUs Solve Hazards?
Moreover, GPUs (and SIMDs in general) have additional challenges: branch divergence.
![Page 63: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/63.jpg)
Let's Start Again from a simple CPU
Let's start from MIPS.
![Page 64: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/64.jpg)
A Naive SIMD Processor
How to handle branch divergence within a vector?
![Page 65: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/65.jpg)
Handling Branch In GPU●Threads within a warp are free to branch.
if( $r17 > $r19 ){ $r16 = $r20 + $r31 }else{ $r16 = $r21 - $r32 }$r18 = $r15 + $r16
Assembly code on the right are disassembled from cuda binary (cubin) using "decuda", link
![Page 66: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/66.jpg)
If No Branch Divergence in a Warp●If threads within a warp take the same path, the other path will not be
executed.
![Page 67: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/67.jpg)
Branch Divergence within a Warp●If threads within a warp diverge, both paths have to be executed.●Masks are set to filter out threads not executing on current path.
![Page 68: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/68.jpg)
Dynamic Warp FormationExample: merge two divergent warps into a new warp if possible.
Create warp 3 from warp 0 and 1
Create warp 4 from warp 0 and 1ref: Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. http://dx.doi.org/10.1145/1543753.1543756
![Page 69: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/69.jpg)
Single Instruction Multiple Thread
●If no divergent branch, SIMT acts like time-multiplexed interleaved threading SIMD. A vector of 32 data lanes has a single program state.
●At divergent branch, SIMT data lanes are assigned program states in a branch stack.
Static DynamicILP VLIW superscalarDLP SIMD SIMT
Classification of SIMT by Hennessey and Patterson
![Page 70: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/70.jpg)
Performance Modeling
What are the bottlenecks?
![Page 71: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/71.jpg)
Roofline ModelPerformance bottleneck: computation bound v.s. bandwidth boundNote the log-log scale.
![Page 72: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/72.jpg)
Optimization Is Key for Attainable Gflops/s
![Page 73: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/73.jpg)
Computation, Bandwidth, LatencyIllustrating three bottlenecks in the Roofline model.
![Page 74: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/74.jpg)
Program Optimization
If I know the bottleneck, how to optimize?
Optimization techniques for NVIDIA GPUs:"Best Practices Guide"http://docs.nvidia.com/cuda/
![Page 75: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/75.jpg)
Recap: What We have Learned
● GPU architecture (SIMD v.s. SIMT)● GPU programming (vector v.s. thread)● Performance analysis (potential bottlenecks)
![Page 76: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/76.jpg)
Further ReadingGPU architecture and programming:● NVIDIA Tesla: A unified graphics and computing
architecture, in IEEE Micro 2008. (http://dx.doi.org/10.1109/MM.2008.31)
● Scalable Parallel Programming with CUDA, in ACM Queue 2008. (http://dx.doi.org/10.1145/1365490.1365500)
● Understanding throughput-oriented architectures, in Communications of the ACM 2010. (http://dx.doi.org/10.1145/1839676.1839694)
![Page 77: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/77.jpg)
Recommended Reading (Optional)Performance modeling:● Roofline: an insightful visual performance model for
multicore architectures, in Communications of the ACM 2009. (http://dx.doi.org/10.1145/1498765.1498785)
Handling branch divergence in SIMT:● Dynamic warp formation: Efficient MIMD control flow on
SIMD graphics hardware, in ACM Transactions on Architecture and Code Optimization 2009. (http://dx.doi.org/10.1145/1543753.1543756)
![Page 78: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/78.jpg)
Thank You
Questions?
Disclaimer: The opinions expressed in this presentation are solely from the presenter, and do not express the views or opinions of his/her employer.
![Page 79: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/79.jpg)
Backup Slides
![Page 80: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/80.jpg)
Common Optimization TechniquesOptimizations on memory latency tolerance
●Reduce register pressure●Reduce shared memory pressure
Optimizations on memory bandwidth ●Global memory coalesce ●Avoid shared memory bank conflicts●Grouping byte access ●Avoid Partition camping
Optimizations on computation efficiency ●Mul/Add balancing●Increase floating point proportion
Optimizations on operational intensity ●Use tiled algorithm●Tuning thread granularity
![Page 81: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/81.jpg)
Common Optimization TechniquesOptimizations on memory latency tolerance
●Reduce register pressure●Reduce shared memory pressure
Optimizations on memory bandwidth ●Global memory coalesce ●Avoid shared memory bank conflicts●Grouping byte access ●Avoid Partition camping
Optimizations on computation efficiency ●Mul/Add balancing●Increase floating point proportion
Optimizations on operational intensity ●Use tiled algorithm●Tuning thread granularity
![Page 82: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/82.jpg)
Multiple Levels of Memory HierarchyName Cache? Latency (cycle) Read-only?
Global L1/L2 200~400 (cache miss) R/W
Shared No 1~3 R/W
Constant Yes 1~3 Read-only
Texture Yes ~100 Read-only
Local L1/L2 200~400 (cache miss) R/W
![Page 83: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/83.jpg)
Shared Mem Contains Multiple Banks
![Page 84: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/84.jpg)
Compute CapabilityNeed arch info to perform optimization.
ref: NVIDIA, "CUDA C Programming Guide", (link)
![Page 85: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/85.jpg)
Shared Memory (compute capability 2.x)
withoutbankconflict:
withbankconflict:
![Page 86: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/86.jpg)
Common Optimization TechniquesOptimizations on memory latency tolerance
●Reduce register pressure●Reduce shared memory pressure
Optimizations on memory bandwidth ●Global memory coalesce ●Avoid shared memory bank conflicts●Grouping byte access ●Avoid Partition camping
Optimizations on computation efficiency ●Mul/Add balancing●Increase floating point proportion
Optimizations on operational intensity ●Use tiled algorithm●Tuning thread granularity
![Page 87: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/87.jpg)
Global Memory In Off-Chip DRAMAddress space is interleaved among multiple channels.
![Page 88: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/88.jpg)
Global Memory
![Page 89: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/89.jpg)
Global Memory
![Page 90: Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Titan X GPU with Maxwell architecture](https://reader035.vdocuments.net/reader035/viewer/2022062409/5697bff91a28abf838cbfaf0/html5/thumbnails/90.jpg)
Global Memory