parallel computing and gpu introduction · 2013-12-17 · parallel computing and gpu introduction ....
TRANSCRIPT
![Page 2: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/2.jpg)
Agenda • Parallel computing • GPU introduction • Interconnection networks • Parallel benchmark • Parallel programming • GPU programming
![Page 3: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/3.jpg)
國立台灣大學 National Taiwan University
Parallel computing
![Page 4: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/4.jpg)
Goal of computing
Faster, faster and faster
![Page 5: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/5.jpg)
Why parallel computing? • Moore's law is dead (for CPU frequency)
![Page 6: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/6.jpg)
Top500 (Nov 2013) 1. Tianhe-2(NUDT) 3,120,000 cores (Intel Xeon E5, Intel Xeon Phi)
2. Titan (Cray) 560,640 cores (Opetron 6274, NVIDIA K20x)
3. Sequoia (IBM) 1,572,864 cores (Power BQC)
4. K computer (Fujitsu) 705,024 cores (Sparc64)
5. Mira (IBM) 786,432 cores (Power BQC)
![Page 7: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/7.jpg)
Amdahl's law Serial
Serial
Serial
Serial
Parallelizable work
2 processors
4 processors
many processors
![Page 8: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/8.jpg)
Amdahl's law
Total speedup = 11−𝑃 +𝑃/𝑆
Parallelizable work
Speedup for Parallelizable work
Example: P: 0.8 (80% work is parallelizable) S: 8 (8 processors) Total speedup: 3.33x
![Page 9: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/9.jpg)
Amdahl's law
![Page 10: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/10.jpg)
Scaling exapmle • Workload: sum of 10 scalars, and 10×10
matrix sum • Single processor Time = (10+100)×Tadd = 110×Tadd
• 10 processors Time = 10×Tadd + (100/10)×Tadd = 20×Tadd
Speedup = 110/20 = 5.5× (55% of potential)
• 100 processors Time = 10×Tadd + (100/100)×Tadd = 11×Tadd
Speedup = 110/11 = 10× (10% of potential)
![Page 11: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/11.jpg)
Scaling example • What if matrix size is 100×100? • Single processor Time = (10+10000)×Tadd = 10010×Tadd
• 10 processors Time = 10×Tadd + (10000/10)×Tadd = 1010×Tadd
Speedup = 10010/1010 = 9.9× (99% of potential)
• 100 processors Time = 10×Tadd + (10000/100)×Tadd = 110×Tadd
Speedup = 10010/110 = 91× (91% of potential)
![Page 12: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/12.jpg)
Scalability • The ability of a system to handle growing
amount of work
• Strong scaling Fixed total problem size Run a fixed problem faster
• Weak scaling Fixed problem size per processor Run a bigger (or smaller) problem
![Page 13: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/13.jpg)
Parallel computing system • Parallelization design for processors • Hardware multithreading • Multi-processor system • Cluster computing system • Grid computing system
![Page 14: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/14.jpg)
Parallelization design for processors • Instruction level parallelism
add $t0, $t1, $t2
mul $t3, $t4, $t5
• Data level parallelism add 0($t1), 0($t2), 0($t3)
add 4($t1), 4($t2), 4($t3)
add 8($t1), 8($t2), 8($t3)
![Page 15: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/15.jpg)
Flynn's taxonomy
Single instruction Multiple instruction
Single data SISD (Single-core processor)
MISD (very rare)
Multiple data SIMD (Superscalar, vector processor, GPU, etc.)
MIMD (Multi-core processor)
![Page 16: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/16.jpg)
SIMD • Operate element-wise on vectors of data MMX and SSE instructions in x86 Multiple data elements in 128-bit wide registers
• All processors execute the same instruction at the same time, each with different data address
• Simplifies synchronization • Reduced instruction control hardware • Works best for highly data-parallel applications
![Page 17: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/17.jpg)
Example: dot product mov esi, dword ptr [src]
mov edi, dword ptr [dst]
mov ecx, Count
start:
movaps xmm0, [esi] //a3, a2, a1, a0
mulps xmm0, [esi + 16] //a3*b3,a2*b2,a1*b1,a0*b0
haddps xmm0, xmm0 //a3*b3+a2*b2,a1*b1+a0*b0,
//a3*b3+a2*b2,a1*b1+a0*b0
movaps xmm1, xmm0
psrldq xmm0, 8
addss xmm0, xmm1
movss [edi],xmm0
add esi, 32
add edi, 4
sub ecx, 1
jnz start
![Page 18: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/18.jpg)
Vector processors • Highly pipelined function units • Stream data from/to vector registers to units Data collected from memory into registers Results stored from registers to memory
• Example: Vector extension to MIPS 32 × 64-element registers (64-bit elements) Vector instructions
• lv, sv: load/store vector • addv.d: add vectors of double • addvs.d: add scalar to each element of vector of double
• Significantly reduces instruction-fetch bandwidth
![Page 19: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/19.jpg)
Example: DAXPY (Y = a × X + Y) • Conventional MIPS code l.d $f0,a($sp) ;load scalar a addiu r4,$s0,#512 ;upper bound of what to load loop: l.d $f2,0($s0) ;load x(i) mul.d $f2,$f2,$f0 ;a × x(i) l.d $f4,0($s1) ;load y(i) add.d $f4,$f4,$f2 ;a × x(i) + y(i) s.d $f4,0($s1) ;store into y(i) addiu $s0,$s0,#8 ;increment index to x addiu $s1,$s1,#8 ;increment index to y subu $t0,r4,$s0 ;compute bound bne $t0,$zero,loop ;check if done • Vector MIPS code l.d $f0,a($sp) ;load scalar a lv $v1,0($s0) ;load vector x mulvs.d $v2,$v1,$f0 ;vector-scalar multiply lv $v3,0($s1) ;load vector y addv.d $v4,$v2,$v3 ;add y to product sv $v4,0($s1) ;store the result
![Page 20: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/20.jpg)
Hardware multithreading • Allows multiple threads to share the functional
units of a single processor in an overlapping fashion Coarse-grained multithreading
• Switches threads only on costly stall
Fine-grained multithreading • Interleaved execution of multiple threads
Simultaneous multithreading • Multiple-issue, dynamically scheduled processor to
exploit thread-level parallelism
![Page 21: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/21.jpg)
Hardware multithreading
![Page 22: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/22.jpg)
Multi-processor system • Shared memory multi-processor
![Page 23: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/23.jpg)
Multi-processor system • Non-uniform memory access multi-processor
![Page 24: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/24.jpg)
Cluster computing system
![Page 25: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/25.jpg)
Grid computing system
![Page 26: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/26.jpg)
國立台灣大學 National Taiwan University
GPU introduction
![Page 27: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/27.jpg)
Computer graphics rendering
![Page 28: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/28.jpg)
History of computer graphics • 1960, Ivan Sutherland's Sketchpad The beginning of computer graphics
• 1992, OpenGL 1.0 • 1996, Voodoo I The first consumer 3D graphics card
• 1996, DirectX 3.0 The first version including Direct3D
![Page 29: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/29.jpg)
The history of computer graphics • 2000, DirectX 8.0 The first version supporting HLSL
• 2001, GeForce 3 (NV20) The first consumer GPU
• 2004, OpenGL 2.0 The first version supporting GLSL
• 2006, GeForce 8 (G80) The first NVIDIA GPU supporting CUDA
• 2008 OpenCL (Apple, AMD, IBM, Qualcomm, Intel, …)
![Page 30: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/30.jpg)
Graphics pipeline
Input processor
Do geometry stuff
Do pixel stuff
Accumulate pixel result
Instructions States Data
Transforms Lighting, etc
Rasterize Pixel shading
Z-buffer Transparency
![Page 31: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/31.jpg)
Do pixel stuff
Do geometry stuff
Make it faster
Input processor
Do geometry stuff
Do pixel stuff
Accumulate pixel result
Do geometry stuff
Do pixel stuff
![Page 32: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/32.jpg)
Do pixel stuff
Do geometry stuff
Add framebuffer support
Input processor
Do geometry stuff
Do pixel stuff
Accumulate pixel result
Do geometry stuff
Do pixel stuff
FB (memory)
![Page 33: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/33.jpg)
Get data Process data Output data
Do pixel stuff
Do geometry stuff
Add programmability
Front end
Do geometry stuff
Do pixel stuff
Raster operations
Geometry shader ALU
Pixel shader ALU
FB (memory)
![Page 34: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/34.jpg)
Do geometry stuff
Uniform shader
Front end
Do geometry stuff
Raster operations
Uniform shader ALU FB
(memory) Buffer
![Page 35: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/35.jpg)
Do geometry stuff
Scaling it up again
Front end
Do geometry stuff
Raster operations
Uniform shader ALU FB
(memory) Buffer
![Page 36: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/36.jpg)
NVIDIA tesla GPU
Processors
Sorting, Distribution Memory
I/O
![Page 37: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/37.jpg)
國立台灣大學 National Taiwan University
Interconnection networks
![Page 38: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/38.jpg)
Interconnection networks
• Performance metric Network bandwidth, the peek transfer rate of a
network (best case) Bisection bandwidth, the bandwidth between two
equal parts of a multiprocessor (worse case)
Bus Ring
![Page 39: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/39.jpg)
Network topologies
Bus Ring
2D mesh N-cube Fully connected
![Page 40: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/40.jpg)
Multistage networks
Crossbar Omega network
![Page 41: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/41.jpg)
Network characterisitics • Performance Latency Throughput
• Link bandwidth • Total network bandwidth • Bisection bandwidth
Congestion delays
• Cost • Power • Routability in silicon
![Page 42: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/42.jpg)
國立台灣大學 National Taiwan University
Parallel benchmark
![Page 43: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/43.jpg)
Parallel benchmark • Linpack: matrix linear algebra • SPECrate: parallel run of SPEC CPU programs Job-level parallelism
• SPLASH: Stanford Parallel Applications for Shared Memory Mix of kernels and applications, strong scaling
• NAS (NASA Advanced Supercomputing) suite computational fluid dynamics kernels
• PARSEC (Princeton Application Repository for Shared Memory Computers) suite Multithreaded applications using Pthreads and OpenMP
![Page 44: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/44.jpg)
Modeling performance • What is the performance metric of interest? Attainable GFLOPs/second Measured using computational kernels from
Berkeley Design Pattern
• Arithmetic intensity FLOPs per byte of memory access
• For a given computer, determine Peak FLOPs Peak memory bytes/second
![Page 45: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/45.jpg)
Roofline diagram
![Page 46: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/46.jpg)
Comparing systems • Opteron X2 vs. X4 2-core vs. 4-core 2.2GHz vs. 2.3GHz Same memory system
![Page 47: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/47.jpg)
Optimizing performance
![Page 48: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/48.jpg)
Optimizing performance • Choices of optimization depends on
arithmetic intensity of code
![Page 49: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/49.jpg)
Four example systems
Intel Xeon e5345 AMD Opteron X4 2356
Sun UltraSPARC T2 5140 IBM Cell QS20
![Page 50: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/50.jpg)
Roofline diagrams
![Page 51: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/51.jpg)
Conclusion • Goal: higher performance by using multiple
processors • Difficulties Developing parallel software Devising appropriate architectures
• Many reasons for optimism Changing software and application environment Chip-level multiprocessors with lower latency, higher
bandwidth interconnect
• An ongoing challenge for computer architects!
![Page 52: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/52.jpg)
國立台灣大學 National Taiwan University
Parallel programming
![Page 53: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/53.jpg)
Can a program be parallelized? • Matrix multiplication
for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
for (int k = 0; k < K; ++k)
C[i][j] = A[i][k] * B[k][j];
• Fibonacci sequence A = 0, B = 1
for (int i = 0; i < N; ++i) {
C = A + B;
A = B;
B = C;
}
![Page 54: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/54.jpg)
Parallel programming • Software/algorithm is the key • Significant performance improvement Otherwise, Just use a faster uniprocessor
• Difficulties Partitioning Coordination Communication overhead
![Page 55: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/55.jpg)
Parallel programming • Job-level parallelism • Single program runs on multiple processors • Single program runs on multiple computers
![Page 56: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/56.jpg)
Job-level parallelism • Operation system does it now • How to improve the throughput? Processors number vs. jobs number Memory usage I/O statistics Scheduling and priority ...
![Page 57: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/57.jpg)
Multi-process program • Process a running instance of a program
• Example:
WWW server
Browser
Browser
Browser
Browser
Process
Process
Process
Process
![Page 58: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/58.jpg)
Multi-thread program • Thread Light weight process
• Example:
WWW server
Browser
Browser
Browser
Browser
Thread
Thread
Thread
Thread
![Page 59: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/59.jpg)
Multi-process vs. multi-thread • Performance Launch time Context switch Kernel-awareness scheduling
• Communication Inter-process vs. inter-thread communication
• Stability
![Page 60: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/60.jpg)
Message passing interface (MPI)
Node
Node
Node Node
Node
Node Broadcast
![Page 61: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/61.jpg)
MapReduce/hadoop
![Page 62: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/62.jpg)
國立台灣大學 National Taiwan University
GPU programming
![Page 63: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/63.jpg)
GPGPU
![Page 64: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/64.jpg)
Cuda
![Page 65: Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction](https://reader034.vdocuments.net/reader034/viewer/2022050220/5f66125759bdf00d5240783a/html5/thumbnails/65.jpg)
國立台灣大學 National Taiwan University
Thanks!