![Page 1: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/1.jpg)
Lecture 5: HW1 Discussion, Intro to GPUs
G63.2011.002/G22.2945.001 · October 5, 2010
Discuss HW1 Intro to GPU Computing
![Page 2: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/2.jpg)
Outline
Discuss HW1
Intro to GPU Computing
Discuss HW1 Intro to GPU Computing
![Page 3: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/3.jpg)
Outline
Discuss HW1
Intro to GPU Computing
Discuss HW1 Intro to GPU Computing
![Page 4: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/4.jpg)
Dense Matrix Multiply: Blocking vs Scalar
We provided a blocked example matrixmultiplication code.Why is blocked matmul faster thanun-blocked?
Key: Computational Intensity
Definition:Flops per FPN moved up the memoryhierarchy
Large intensity: good for deep memoryhierarchies
Discuss HW1 Intro to GPU Computing
![Page 5: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/5.jpg)
Computational Intensity for Scalar Matmul
Floating Point operations: 2N3
Assume: Size(L1) � N2 FPNs
N2 read each row of A once+ N3 read each column of B N times
+ 2N2 read/write C
N3 + 3N2 FPN-size cache misses
(neglecting cache lines, etc.)
Computational Intensity: about 2
Discuss HW1 Intro to GPU Computing
![Page 6: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/6.jpg)
Computational Intensity for Blocked MatmulFloating Point operations: still 2N3
b: block size n: dN/be
b2n3 read each block of A n3 times+ b2n3 same for B+ 2N2 read/write C
2b2n3 + 2N2 FPN-size cache misses
Rewrite:
b2n3 ≈ b2 N3
b3=
N3
b
Computational Intensity:
2N3
2N3/b + 2N2≈ 2N3
2N3/b= b
→ incentive to choose b � 2
The power of assumptions:Can we choose b = N?
Discuss HW1 Intro to GPU Computing
![Page 7: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/7.jpg)
Computational Intensity for Blocked MatmulFloating Point operations: still 2N3
b: block size n: dN/be
b2n3 read each block of A n3 times+ b2n3 same for B+ 2N2 read/write C
2b2n3 + 2N2 FPN-size cache misses
Rewrite:
b2n3 ≈ b2 N3
b3=
N3
b
Computational Intensity:
2N3
2N3/b + 2N2≈ 2N3
2N3/b= b
→ incentive to choose b � 2
The power of assumptions:Can we choose b = N?
Discuss HW1 Intro to GPU Computing
![Page 8: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/8.jpg)
Hatching a Plan
Consider each level of the memory hierarchy.
How do we exploit. . .
• . . . L2: Ignore–we’re nearly L2-local atmost sizes.
• . . . L1: 32 KiB = 4096 Floats.Key: memory layout.
• . . . registers: 16 FP registers.Key: loop/operation ordering.
Discuss HW1 Intro to GPU Computing
![Page 9: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/9.jpg)
Optimizing for L1: Memory Layout
Memory layout of A: column-major.
Only use one entry of each cache line perfetch.
Better to store A in row-major order.
Input is row-major. If memory available (notswap!), storing a transposed copy of A can bea good idea. (Copy takes O(N2) time.)
Discuss HW1 Intro to GPU Computing
![Page 10: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/10.jpg)
Optimizing for L1: Memory Layout
Memory layout of A: column-major.
Only use one entry of each cache line perfetch.
Better to store A in row-major order.
Input is row-major. If memory available (notswap!), storing a transposed copy of A can bea good idea. (Copy takes O(N2) time.)
Discuss HW1 Intro to GPU Computing
![Page 11: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/11.jpg)
Optimizing for L1: Reuse Pattern, Block Size
QuestionBlocking: good idea. Optimal bL1?
Follow-up question:How much needs to fit in L1?
One block of each of A, B, C .All of A, plus one column of B and C .32 kiB: 8b2
L1 + 2 · 8bL1 → bL1 ≤ 60
Discuss HW1 Intro to GPU Computing
![Page 12: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/12.jpg)
Optimizing for L1: Reuse Pattern, Block Size
QuestionBlocking: good idea. Optimal bL1?
Follow-up question:How much needs to fit in L1?
One block of each of A, B, C .
All of A, plus one column of B and C .32 kiB: 8b2
L1 + 2 · 8bL1 → bL1 ≤ 60
Discuss HW1 Intro to GPU Computing
![Page 13: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/13.jpg)
Optimizing for L1: Reuse Pattern, Block Size
QuestionBlocking: good idea. Optimal bL1?
Follow-up question:How much needs to fit in L1?
One block of each of A, B, C .
All of A, plus one column of B and C .32 kiB: 8b2
L1 + 2 · 8bL1 → bL1 ≤ 60
Discuss HW1 Intro to GPU Computing
![Page 14: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/14.jpg)
Optimizing for L1: Reuse Pattern, Block Size
QuestionBlocking: good idea. Optimal bL1?
Follow-up question:How much needs to fit in L1?
One block of each of A, B, C .All of A, plus one column of B and C .32 kiB: 8b2
L1 + 2 · 8bL1 → bL1 ≤ 60
Discuss HW1 Intro to GPU Computing
![Page 15: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/15.jpg)
L1 Block Copy
Further concerns:
• Cache line boundaries
• SIMD
• Cache set conflicts
All solved by small-block copyoptimization.
Copy all of A.Copy bL1-sized blocks of A, B, and C ,operate on those, then copy outputback.
Discuss HW1 Intro to GPU Computing
![Page 16: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/16.jpg)
L1 Block Copy: The Plan
Basic plan:
For each i :For each j :
Load Block C [i , j ]For each k :
Load Block A[i , k]Load Block B[k , j ]dbL1/bre3 register kernels:
C + = ABStore Block C [i , j ]
(can be improved: many A, B loads)
Aside: Also neatly deals with fringes.
So: how does this solve the problems above?Can you define “alignment”?
Discuss HW1 Intro to GPU Computing
![Page 17: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/17.jpg)
L1 Block Copy: The Plan
Basic plan:
For each i :For each j :
Load Block C [i , j ]For each k :
Load Block A[i , k]Load Block B[k , j ]dbL1/bre3 register kernels:
C + = ABStore Block C [i , j ]
(can be improved: many A, B loads)
Aside: Also neatly deals with fringes.
So: how does this solve the problems above?Can you define “alignment”?
Discuss HW1 Intro to GPU Computing
![Page 18: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/18.jpg)
Alignment
A memory address a is n-byte aligned when n is a power of twoand a is a multiple of n bytes. (see also IBM devWorks article)
#include <stdlib.h>
/∗ dynamic allocation ∗/double ∗ attribute (( aligned (64))) var ;
int error = posix memalign((void ∗∗) &var, 64, array size );
if ( error )abort ();
/∗ static allocation ∗/double attribute (( aligned (64))) ary2 [500];
Examples: Cache-line-aligned, SIMD-aligned.
Code generation in the non-aligned case?
Discuss HW1 Intro to GPU Computing
![Page 19: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/19.jpg)
Alignment
A memory address a is n-byte aligned when n is a power of twoand a is a multiple of n bytes. (see also IBM devWorks article)
#include <stdlib.h>
/∗ dynamic allocation ∗/double ∗ attribute (( aligned (64))) var ;
int error = posix memalign((void ∗∗) &var, 64, array size );
if ( error )abort ();
/∗ static allocation ∗/double attribute (( aligned (64))) ary2 [500];
Examples: Cache-line-aligned, SIMD-aligned.
Code generation in the non-aligned case?
Discuss HW1 Intro to GPU Computing
![Page 20: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/20.jpg)
Register KernelChoose block size br = 2k ,with bL1 mod br = 0.
for ( int j = 0; j < b r; ++j)for ( int k = 0; k < b r; ++k)for ( int i = 0; i < b r; ++i)
C[i+j∗b l1] +=A[i+k∗b l1] ∗ B[k+j∗b l1 ];
For each Ab matvec:Perform br scalar·vector updates.
• Vectorizable
• Pipeline-friendly(min. data dependencies)
• Access to A, C unit-stride
• Access to B is inner-loop invariant
• Unrolling, software pipelining: Compiler
Discuss HW1 Intro to GPU Computing
![Page 21: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/21.jpg)
Psychoanalyzing the Compiler
Flags for Intel:-O3 -fno-alias -funroll-loops
-std=c99 -D XOPEN SOURCE=500
-opt-streaming-stores auto -static
-fast -xHost
Flags for GCC:-O3 -funroll-loops -march=native
-std=c99 -D XOPEN SOURCE=500
-ftree-vectorizer-verbose=2
-ffast-math
GCC 4.3 sometimes better than GCC 4.4.
Self-study material:
• Compiler Reference: Intel GNU
• C99 restrict keyword, Aliasing
Discuss HW1 Intro to GPU Computing
![Page 22: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/22.jpg)
Profiling
OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.
Many event types countable:
CPU CLK UNHALTED : Clock cycles when not halted
L2 RQSTS : number of L2 cache requests
LLC MISSES : L2 cache demand requests from this core that
missed the L2
FLOPS : number of FP computational micro-ops executed
IDLE DURING DIV : cycles divider is busy and all other
execution units are idle.
L1D ALL REF : All references to the L1 data cache
L1D PEND MISS : Total number of outstanding L1 data cache
misses at any cycle
IFU MEM STALL : cycles instruction fetch pipe is stalled
INST RETIRED : number of instructions retired
UOPS RETIRED : number of UOPs retired
MACHINE NUKES SMC : number of pipeline flushing events
RAT STALLS : Partial register stall cycles
BR INST DECODED : number of branch instructions decoded
FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7
187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3
470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2
2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax
184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0
5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0
4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0
5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)
Discuss HW1 Intro to GPU Computing
![Page 23: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/23.jpg)
Profiling
OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.
Many event types countable:
CPU CLK UNHALTED : Clock cycles when not halted
L2 RQSTS : number of L2 cache requests
LLC MISSES : L2 cache demand requests from this core that
missed the L2
FLOPS : number of FP computational micro-ops executed
IDLE DURING DIV : cycles divider is busy and all other
execution units are idle.
L1D ALL REF : All references to the L1 data cache
L1D PEND MISS : Total number of outstanding L1 data cache
misses at any cycle
IFU MEM STALL : cycles instruction fetch pipe is stalled
INST RETIRED : number of instructions retired
UOPS RETIRED : number of UOPs retired
MACHINE NUKES SMC : number of pipeline flushing events
RAT STALLS : Partial register stall cycles
BR INST DECODED : number of branch instructions decoded
FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7
187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3
470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2
2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax
184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0
5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0
4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0
5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)
Discuss HW1 Intro to GPU Computing
![Page 24: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/24.jpg)
Profiling
OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.
Many event types countable:
CPU CLK UNHALTED : Clock cycles when not halted
L2 RQSTS : number of L2 cache requests
LLC MISSES : L2 cache demand requests from this core that
missed the L2
FLOPS : number of FP computational micro-ops executed
IDLE DURING DIV : cycles divider is busy and all other
execution units are idle.
L1D ALL REF : All references to the L1 data cache
L1D PEND MISS : Total number of outstanding L1 data cache
misses at any cycle
IFU MEM STALL : cycles instruction fetch pipe is stalled
INST RETIRED : number of instructions retired
UOPS RETIRED : number of UOPs retired
MACHINE NUKES SMC : number of pipeline flushing events
RAT STALLS : Partial register stall cycles
BR INST DECODED : number of branch instructions decoded
FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7
187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3
470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2
2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax
184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0
5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0
4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0
5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)
Discuss HW1 Intro to GPU Computing
![Page 25: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/25.jpg)
Profiling
OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.
Many event types countable:
CPU CLK UNHALTED : Clock cycles when not halted
L2 RQSTS : number of L2 cache requests
LLC MISSES : L2 cache demand requests from this core that
missed the L2
FLOPS : number of FP computational micro-ops executed
IDLE DURING DIV : cycles divider is busy and all other
execution units are idle.
L1D ALL REF : All references to the L1 data cache
L1D PEND MISS : Total number of outstanding L1 data cache
misses at any cycle
IFU MEM STALL : cycles instruction fetch pipe is stalled
INST RETIRED : number of instructions retired
UOPS RETIRED : number of UOPs retired
MACHINE NUKES SMC : number of pipeline flushing events
RAT STALLS : Partial register stall cycles
BR INST DECODED : number of branch instructions decoded
FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7
187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3
470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2
2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax
184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0
5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0
4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0
5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)
Discuss HW1 Intro to GPU Computing
![Page 26: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/26.jpg)
Profiling
OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.
Many event types countable:
CPU CLK UNHALTED : Clock cycles when not halted
L2 RQSTS : number of L2 cache requests
LLC MISSES : L2 cache demand requests from this core that
missed the L2
FLOPS : number of FP computational micro-ops executed
IDLE DURING DIV : cycles divider is busy and all other
execution units are idle.
L1D ALL REF : All references to the L1 data cache
L1D PEND MISS : Total number of outstanding L1 data cache
misses at any cycle
IFU MEM STALL : cycles instruction fetch pipe is stalled
INST RETIRED : number of instructions retired
UOPS RETIRED : number of UOPs retired
MACHINE NUKES SMC : number of pipeline flushing events
RAT STALLS : Partial register stall cycles
BR INST DECODED : number of branch instructions decoded
FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7
187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3
470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2
2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax
184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0
5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0
4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0
5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)
Discuss HW1 Intro to GPU Computing
![Page 27: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/27.jpg)
Solution Performance
0 100 200 300 400 500 600 700 800Matrix dimension N
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
MFl
ops/
s
basic
tuned
blas
git clone
ssh://[email protected]:2234/hw1-solution.git
(Private, works if you signed up for an account.)
Great–but:Most BLAS lose out to triple-loops forspecial-case matrices.
Want to see code of a “real” BLAS?GotoBLAS2
Discuss HW1 Intro to GPU Computing
![Page 28: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/28.jpg)
Solution Performance
0 100 200 300 400 500 600 700 800Matrix dimension N
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
MFl
ops/
s
basic
tuned
blas
git clone
ssh://[email protected]:2234/hw1-solution.git
(Private, works if you signed up for an account.)
Great–but:Most BLAS lose out to triple-loops forspecial-case matrices.
Want to see code of a “real” BLAS?GotoBLAS2
Discuss HW1 Intro to GPU Computing
![Page 29: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/29.jpg)
Key Messages of HW1
In HPC:
• Very simple things quickly becomerather complex.
• Need: ideas, careful analysis.
• Flexibility ↔ performance
• Run-time code generation can beuseful.
This class helps by introducing
• known tricks
• helpful tools.
Matmul is a “microcosm” of single-procoptimization.
Do not worry if you did not figure outthe tricks here on your own.
Discuss HW1 Intro to GPU Computing
![Page 30: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/30.jpg)
Key Messages of HW1
In HPC:
• Very simple things quickly becomerather complex.
• Need: ideas, careful analysis.
• Flexibility ↔ performance
• Run-time code generation can beuseful.
This class helps by introducing
• known tricks
• helpful tools.
Matmul is a “microcosm” of single-procoptimization.
Do not worry if you did not figure outthe tricks here on your own.
Discuss HW1 Intro to GPU Computing
![Page 31: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/31.jpg)
Questions?
?
Discuss HW1 Intro to GPU Computing
![Page 32: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/32.jpg)
Outline
Discuss HW1
Intro to GPU Computing
Discuss HW1 Intro to GPU Computing
![Page 33: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/33.jpg)
GPUs: System Context
Processor
Memory
Expansion Slots
PCI-Express (x4, x16, x1, x16)and regular PCI
PCIe V2, x16 Bandwidth:∼ 6 GB/s
GPU goes here
Discuss HW1 Intro to GPU Computing
![Page 34: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/34.jpg)
GPUs: System Context
Processor
Memory
Expansion Slots
PCI-Express (x4, x16, x1, x16)and regular PCI
PCIe V2, x16 Bandwidth:∼ 6 GB/s
GPU goes here
Discuss HW1 Intro to GPU Computing
![Page 35: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/35.jpg)
GPUs: System Context
Processor
Memory
Expansion Slots
PCI-Express (x4, x16, x1, x16)and regular PCI
PCIe V2, x16 Bandwidth:∼ 6 GB/s
GPU goes here
Discuss HW1 Intro to GPU Computing
![Page 36: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/36.jpg)
GPUs: System Context
Processor
Memory
Expansion Slots
PCI-Express (x4, x16, x1, x16)and regular PCI
PCIe V2, x16 Bandwidth:∼ 6 GB/s
GPU goes here
Discuss HW1 Intro to GPU Computing
![Page 37: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/37.jpg)
GPUs: System Context
Processor
Memory
Expansion Slots
PCI-Express (x4, x16, x1, x16)and regular PCI
PCIe V2, x16 Bandwidth:∼ 6 GB/s
GPU goes here
Discuss HW1 Intro to GPU Computing
![Page 38: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/38.jpg)
GPU Computing?
• Design target for CPUs:• Make a single thread very fast• Take control away from
programmer
• GPU Computing takes adifferent approach:
• Throughput matters—single threads do not
• Give explicit control toprogrammer
Discuss HW1 Intro to GPU Computing
![Page 39: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/39.jpg)
“CPU-style” Cores
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
CPU-“style” cores
ALU (Execute)
Fetch/ Decode
Execution Context
Out-of-order control logic
Fancy branch predictor
Memory pre-fetcher
Data cache (A big one)
13
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 40: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/40.jpg)
Slimming down
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Slimming down
ALU (Execute)
Fetch/ Decode
Execution Context
Idea #1:
Remove components that help a single instruction stream run fast
14
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 41: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/41.jpg)
More Space: Double the Number of Cores
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Two cores (two fragments in parallel)
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
<diffuseShader>:
sample r0, v4, t0, s0
mul r3, v0, cb0[0]
madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
mul o0, r0, r3
mul o1, r1, r3
mul o2, r2, r3
mov o3, l(1.0)
fragment 1
<diffuseShader>:
sample r0, v4, t0, s0
mul r3, v0, cb0[0]
madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
mul o0, r0, r3
mul o1, r1, r3
mul o2, r2, r3
mov o3, l(1.0)
fragment 2
15
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 42: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/42.jpg)
. . . again
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Four cores (four fragments in parallel)
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
16
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 43: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/43.jpg)
. . . and again
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Sixteen cores (sixteen fragments in parallel)
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
16 cores = 16 simultaneous instruction streams 17
Credit: Kayvon Fatahalian (Stanford)
→ 16 independent instruction streams
Reality: instruction streams not actuallyvery different/independent
Discuss HW1 Intro to GPU Computing
![Page 44: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/44.jpg)
. . . and again
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Sixteen cores (sixteen fragments in parallel)
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
16 cores = 16 simultaneous instruction streams 17
Credit: Kayvon Fatahalian (Stanford)
→ 16 independent instruction streams
Reality: instruction streams not actuallyvery different/independent
Discuss HW1 Intro to GPU Computing
![Page 45: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/45.jpg)
Saving Yet More Space
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19
Idea #2
Amortize cost/complexity ofmanaging an instruction streamacross many ALUs
→ SIMD
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 46: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/46.jpg)
Saving Yet More Space
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19
Idea #2
Amortize cost/complexity ofmanaging an instruction streamacross many ALUs
→ SIMD
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 47: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/47.jpg)
Saving Yet More Space
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Add ALUs
Fetch/ Decode
Idea #2:
Amortize cost/complexity of managing an instruction stream across many ALUs
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
SIMD processing Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
20 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19
Idea #2
Amortize cost/complexity ofmanaging an instruction streamacross many ALUs
→ SIMD
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 48: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/48.jpg)
Saving Yet More Space
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Add ALUs
Fetch/ Decode
Idea #2:
Amortize cost/complexity of managing an instruction stream across many ALUs
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
SIMD processing Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
20 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Add ALUs
Fetch/ Decode
Idea #2:
Amortize cost/complexity of managing an instruction stream across many ALUs
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
SIMD processing Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
20
Idea #2
Amortize cost/complexity ofmanaging an instruction streamacross many ALUs
→ SIMD
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 49: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/49.jpg)
Gratuitous Amounts of Parallelism!
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
128 fragments in parallel
= 16 simultaneous instruction streams 16 cores = 128 ALUs
24 Credit: Kayvon Fatahalian (Stanford)
Example:
128 instruction streams in parallel16 independent groups of 8 synchronized streams
Great if everybody in a group does thesame thing.
But what if not?
What leads to divergent instructionstreams?
Discuss HW1 Intro to GPU Computing
![Page 50: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/50.jpg)
Gratuitous Amounts of Parallelism!
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
128 fragments in parallel
= 16 simultaneous instruction streams 16 cores = 128 ALUs
24 Credit: Kayvon Fatahalian (Stanford)
Example:
128 instruction streams in parallel16 independent groups of 8 synchronized streams
Great if everybody in a group does thesame thing.
But what if not?
What leads to divergent instructionstreams?
Discuss HW1 Intro to GPU Computing
![Page 51: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/51.jpg)
Gratuitous Amounts of Parallelism!
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
128 fragments in parallel
= 16 simultaneous instruction streams 16 cores = 128 ALUs
24 Credit: Kayvon Fatahalian (Stanford)
Example:
128 instruction streams in parallel16 independent groups of 8 synchronized streams
Great if everybody in a group does thesame thing.
But what if not?
What leads to divergent instructionstreams?
Discuss HW1 Intro to GPU Computing
![Page 52: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/52.jpg)
Branches
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
But what about branches?
ALU 1 ALU 2 . . . ALU 8 . . . Time
(clocks)
2 ... 1 ... 8
if (x > 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
26
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 53: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/53.jpg)
Branches
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
But what about branches?
ALU 1 ALU 2 . . . ALU 8 . . . Time
(clocks)
2 ... 1 ... 8
if (x > 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
T T T F F F F F
27
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 54: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/54.jpg)
Branches
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
But what about branches?
ALU 1 ALU 2 . . . ALU 8 . . . Time
(clocks)
2 ... 1 ... 8
if (x > 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
T T T F F F F F
Not all ALUs do useful work! Worst case: 1/8 performance
28
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 55: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/55.jpg)
Branches
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
But what about branches?
ALU 1 ALU 2 . . . ALU 8 . . . Time
(clocks)
2 ... 1 ... 8
if (x > 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
T T T F F F F F
29
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 56: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/56.jpg)
Remaining Problem: Slow Memory
ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.
We’ve removed
• caches
• branch prediction
• out-of-order execution
So what now?
Idea #3
Even more parallelism+ Some extra memory
= A solution!
Discuss HW1 Intro to GPU Computing
![Page 57: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/57.jpg)
Remaining Problem: Slow Memory
ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.
We’ve removed
• caches
• branch prediction
• out-of-order execution
So what now?
Idea #3
Even more parallelism+ Some extra memory
= A solution!
Discuss HW1 Intro to GPU Computing
![Page 58: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/58.jpg)
Remaining Problem: Slow Memory
ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.
We’ve removed
• caches
• branch prediction
• out-of-order execution
So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks) Frag 1 … 8
Fetch/ Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU ALU ALU ALU
ALU ALU ALU ALU
33
Idea #3
Even more parallelism+ Some extra memory
= A solution!
Discuss HW1 Intro to GPU Computing
![Page 59: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/59.jpg)
Remaining Problem: Slow Memory
ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.
We’ve removed
• caches
• branch prediction
• out-of-order execution
So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks)
Fetch/ Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
1 2
3 4
1 2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
34
Idea #3
Even more parallelism+ Some extra memory
= A solution!
Discuss HW1 Intro to GPU Computing
![Page 60: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/60.jpg)
Hiding Memory Latency
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks) Frag 1 … 8
Fetch/ Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU ALU ALU ALU
ALU ALU ALU ALU
33
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 61: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/61.jpg)
Hiding Memory Latency
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks)
Fetch/ Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
1 2
3 4
1 2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
34
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 62: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/62.jpg)
Hiding Memory Latency
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks)
Stall
Runnable
1 2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
35
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 63: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/63.jpg)
Hiding Memory Latency
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks)
Stall
Runnable
1 2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
36
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 64: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/64.jpg)
Hiding Memory Latency
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks)
1 2 3 4
Stall
Stall
Stall
Stall
Runnable
Runnable
Runnable
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
37
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 65: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/65.jpg)
Hiding Memory Latency
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Throughput! Time
(clocks)
Stall
Runnable
2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
Done!
Stall
Runnable
Done!
Stall
Runnable
Done!
Stall
Runnable
Done!
1
Increase run time of one group To maximum throughput of many groups
Start
Start
Start
38
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 66: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/66.jpg)
GPU Architecture Summary
Core Ideas:
1. Many slimmed down cores→ lots of parallelism
2. More ALUs, Fewer Control Units
3. Avoid memory stalls by interleavingexecution of SIMD groups
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
![Page 67: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/67.jpg)
GPU-CPU Bird’s Eye Comparison
Floorplan: VIA Isaiah (2008)65 nm, 4 SP ops at a time, 1MiB L2.
Floorplan: AMD RV770 (2008)55 nm, 800 SP opsat a time.
Discuss HW1 Intro to GPU Computing
![Page 68: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/68.jpg)
Nvidia GTX200
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Off-chip Memory150 GB/s
Discuss HW1 Intro to GPU Computing
![Page 69: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/69.jpg)
GPU Architecture (e.g. Nvidia GT200)
• 1 GPU = 30 SIMD cores
• 1 SIMD core: 32× 32 PCs,HW Sched + 1 ID (1/4 clock) +8 SP + 1 DP + 16 KiB Shared +32 KiB Reg
• Device ↔ RAM: 140 GB/s
• Device ↔ Host: 6 GB/s
• User manages memory hierarchy
Discuss HW1 Intro to GPU Computing
![Page 70: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/70.jpg)
What is OpenCL?
OpenCL (Open Computing Language) is anopen, royalty-free standard for general purposeparallel programming across CPUs, GPUs andother processors. [OpenCL 1.1 spec]
• Device-neutral (Nv GPU, AMD GPU,Intel/AMD CPU)
• Vendor-neutral
• Comes with RTCG
Defines:
• Host-side programming interface (library)
• Device-side programming language (!)
Discuss HW1 Intro to GPU Computing
![Page 71: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/71.jpg)
Questions?
?
Discuss HW1 Intro to GPU Computing
![Page 72: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f55e0c3e5dde317043cd51b/html5/thumbnails/72.jpg)
Image Credits
• Blocks: sxc.hu/Avolore• Flag: sxc.hu/Ambrozjo
• Mainboard: Wikimedia Commons
• PCI Express slots: Wikimedia Commons
• Fighting chips: flickr.com/oskay• Isaiah die shot: VIA Technologies• RV770 die shot: AMD Corp.• Nvidia Tesla Architecture: Nvidia Corp.
Discuss HW1 Intro to GPU Computing