cuda gpu computing
DESCRIPTION
CUDA GPU Computing. Advisor : Cho-Chin Lin Student : Chien-Chen Lai. Outline. Introduction and Motivation. What is driving the many-cores?. Control. ALU. ALU. ALU. ALU. DRAM. Cache. DRAM. Design philosophies are different. - PowerPoint PPT PresentationTRANSCRIPT
1
CUDA GPU Computing
Advisor: Cho-Chin Lin
Student : Chien-Chen Lai
2
Outline
Introduction and Motivation
3
What is driving the many-cores?
Quadro FX 5600
NV35 NV40
G70G70-512
G71
Tesla C870
NV30
3.0 GHzCore 2 Quad3.0 GHz
Core 2 Duo3.0 GHz Pentium 4
GeForce8800 GTX
0
100
200
300
400
500
600
Jan 2003 Jul 2003 Jan 2004 Jul 2004 Jan 2005 Jul 2005 Jan 2006 Jul 2006 Jan 2007 Jul 2007
GF
LO
PS
4
Design philosophies are different.
DRAM
Cache
ALUControl
ALU
ALU
ALU
DRAM
CPU GPU
The GPU is specialized for compute-intensive, massively data parallel computation (exactly what graphics rendering is about).
So, more transistors can be devoted to data processing rather than data caching and flow control
5
6
CPU VS. GPU
Jamie and Adam demonstrate the difference between a CPU and GPU.
7
This is not your advisor’s parallel computer! Significant application-level speedup over
uni-processor executionNo more “killer micros”
Easy entrance An initial, naïve code typically get at least 2-
3X speedup
8
This is not your advisor’s parallel computer! Wide availability to end users
available on laptops, desktops, clusters, super-computers
Numerical precision and accuracy IEEE floating-point and double precision
9
Historic GPGPU Constraints
Input Registers
Fragment Program
Output Registers
Constants
Texture
Temp Registers
per threadper Shaderper Context
FB Memory
Dealing with graphics API Working with the corner cases of
the graphics API Addressing modes
Limited texture size/dimension Shader capabilities
Limited outputs Instruction sets
Lack of Integer & bit ops Communication limited
No interaction between pixels No scatter store ability - a[i] = p
10
CUDA - No more shader functions. CUDA integrated CPU+GPU application C program
Serial or modestly parallel C code executes on CPU Highly parallel SPMD kernel C code executes on GPU
CPU Serial CodeGrid 0
. . .
. . .
GPU Parallel Kernel
KernelA<<< nBlk, nTid >>>(args);
Grid 1CPU Serial Code
GPU Parallel Kernel
KernelB<<< nBlk, nTid >>>(args);
11
CUDA for Multi-Core CPU A single GPU thread is too small for a CPU Thread
CUDA emulation does this and performs poorly CPU cores designed for ILP, SIMD
Optimizing compilers work well with iterative loops Turn GPU thread blocks from CUDA into iterative CPU loops
CUDA Grid
GPU CPU
Compiler
12
CUDA for Multi-Core CPU
Application C on single core CPU
Time
CUDA on 4-core CPU
Time
Speedup*
CUDA on G80
Time
MRI-FHD ~1000s 230s ~4x 8.5s
CP 180s 45s 4x .28s
SAD 42.5ms 25.6ms 1.66x 4.75ms
MM (4Kx4K) 7.84s** 15.5s 3.69x 1.12s