cuda gpu computing

12
1 CUDA GPU Computing Advisor Cho-Chin Lin Student Chien- Chen Lai

Upload: nico

Post on 14-Jan-2016

69 views

Category:

Documents


0 download

DESCRIPTION

CUDA GPU Computing. Advisor : Cho-Chin Lin Student : Chien-Chen Lai. Outline. Introduction and Motivation. What is driving the many-cores?. Control. ALU. ALU. ALU. ALU. DRAM. Cache. DRAM. Design philosophies are different. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CUDA GPU Computing

1

CUDA GPU Computing

Advisor: Cho-Chin Lin

Student : Chien-Chen Lai

Page 2: CUDA GPU Computing

2

Outline

Introduction and Motivation

Page 3: CUDA GPU Computing

3

What is driving the many-cores?

Quadro FX 5600

NV35 NV40

G70G70-512

G71

Tesla C870

NV30

3.0 GHzCore 2 Quad3.0 GHz

Core 2 Duo3.0 GHz Pentium 4

GeForce8800 GTX

0

100

200

300

400

500

600

Jan 2003 Jul 2003 Jan 2004 Jul 2004 Jan 2005 Jul 2005 Jan 2006 Jul 2006 Jan 2007 Jul 2007

GF

LO

PS

Page 4: CUDA GPU Computing

4

Design philosophies are different.

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU GPU

The GPU is specialized for compute-intensive, massively data parallel computation (exactly what graphics rendering is about).

So, more transistors can be devoted to data processing rather than data caching and flow control

Page 5: CUDA GPU Computing

5

Page 6: CUDA GPU Computing

6

CPU VS. GPU

Jamie and Adam demonstrate the difference between a CPU and GPU.

Page 7: CUDA GPU Computing

7

This is not your advisor’s parallel computer! Significant application-level speedup over

uni-processor executionNo more “killer micros”

Easy entrance An initial, naïve code typically get at least 2-

3X speedup

Page 8: CUDA GPU Computing

8

This is not your advisor’s parallel computer! Wide availability to end users

available on laptops, desktops, clusters, super-computers

Numerical precision and accuracy IEEE floating-point and double precision

Page 9: CUDA GPU Computing

9

Historic GPGPU Constraints

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per threadper Shaderper Context

FB Memory

Dealing with graphics API Working with the corner cases of

the graphics API Addressing modes

Limited texture size/dimension Shader capabilities

Limited outputs Instruction sets

Lack of Integer & bit ops Communication limited

No interaction between pixels No scatter store ability - a[i] = p

Page 10: CUDA GPU Computing

10

CUDA - No more shader functions. CUDA integrated CPU+GPU application C program

Serial or modestly parallel C code executes on CPU Highly parallel SPMD kernel C code executes on GPU

CPU Serial CodeGrid 0

. . .

. . .

GPU Parallel Kernel

KernelA<<< nBlk, nTid >>>(args);

Grid 1CPU Serial Code

GPU Parallel Kernel

KernelB<<< nBlk, nTid >>>(args);

Page 11: CUDA GPU Computing

11

CUDA for Multi-Core CPU A single GPU thread is too small for a CPU Thread

CUDA emulation does this and performs poorly CPU cores designed for ILP, SIMD

Optimizing compilers work well with iterative loops Turn GPU thread blocks from CUDA into iterative CPU loops

CUDA Grid

GPU CPU

Compiler

Page 12: CUDA GPU Computing

12

CUDA for Multi-Core CPU

Application C on single core CPU

Time

CUDA on 4-core CPU

Time

Speedup*

CUDA on G80

Time

MRI-FHD ~1000s 230s ~4x 8.5s

CP 180s 45s 4x .28s

SAD 42.5ms 25.6ms 1.66x 4.75ms

MM (4Kx4K) 7.84s** 15.5s 3.69x 1.12s