eecs 583 – class 21 research topic 3: compilation for gpus university of michigan december 12,...

EECS 583 – Class 21Research Topic 3: Compilation for GPUs

University of Michigan

December 12, 2011 – Last Class!!

- 2 -

Announcements & Reading Material

This class reading» “Program optimization space pruning for a multithreaded GPU,”

S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008.

Project demos» Dec 13-16, 19 (19th is full)

» Send me email with date and a few timeslots if your group does not have a slot

» Almost all have signed up already

- 3 -

Project Demos

Demo format» Each group gets 30 mins

Strict deadlines because many back to back groups Don’t be late!

» Plan for 20 mins of presentation (no more!), 10 mins questions Some slides are helpful, try to have all group members say

something Talk about what you did (basic idea, previous work), how you did it

(approach + implementation), and results Demo or real code examples are good

Report» 5 pg double spaced including figures – same content as

presentation

» Due either when you do you demo or Dec 19 at 6pm

- 4 -

Midterm Results

Mean: 97.9 StdDev: 13.9 High: 128 Low: 50

If you did poorly, all is not lost. This is a grad class, the project is by far most important!!

Answer key on the course webpage, pick up graded exams from Daya

- 5 -

Why GPUs?

5

- 6 -

Efficiency of GPUs

6

High Memory

Bandwidth

GTX 285 : 159 GB/Seci7 : 32 GB/Sec

High FlopRate

i7 :102 GFLOPS

GTX 285 :1062 GFLOPS

i7 : 51 GFLOPS

GTX 285 : 88.5 GFLOPS

GTX 480 : 168 GFLOPS

High FlopPer Watt

GTX 285 : 5.2 GFLOP/W

i7 : 0.78 GFLOP/W

High FlopPer Dollar

GTX 285 : 3.54 GFLOP/$i7 : 0.36 GFLOP/$

- 7 -

GPU Architecture

7

Shared

Regs

0 1

2 3

4 5

6 7

Interconnection Network

Global Memory (Device Memory)PCIe

Bridge

CPU HostMemory

Shared

Regs

0 1

2 3

4 5

6 7

Shared

Regs

0 1

2 3

4 5

6 7

Shared

Regs

0 1

2 3

4 5

6 7

SM 0 SM 1 SM 2 SM 29

- 8 -

CUDA

“Compute Unified Device Architecture”

General purpose programming model» User kicks off batches of threads on the GPU

Advantages of CUDA» Interface designed for compute - graphics free API

» Orchestration of on chip cores

» Explicit GPU memory management

» Full support for Integer and bitwise operations

8

- 9 -

Programming Model

9

Host

Kernel 1

Kernel 2

Grid 1

Grid 2

DeviceTi

me

- 10 -

Grid 1

GPU Scheduling

10

SM 0

Shared

Regs

0 1

2 3

4 5

6 7

SM 1

Shared

Regs

0 1

2 3

4 5

6 7

SM 2

Shared

Regs

0 1

2 3

4 5

6 7

SM 3

Shared

Regs

0 1

2 3

4 5

6 7

SM 30

Shared

Regs

0 1

2 3

4 5

6 7

- 11 -

Warp Generation

Block 0

Block 1

Block 3

Shared

Registers

0 1

2

4 5

3

6 7

SM0

11

Block 2

ThreadId

0 31 32 63Warp 0 Warp 1

- 12 -

Memory Hierarchy

12

Per BlockShared Memory

__shared__ int SharedVar

Block 0

Per-threadRegister

int LocalVarArray[10]

Per-threadLocal Memory

int RegisterVarThread 0

Grid 0

Per appGlobal Memory

Host

__global__ int GlobalVar

__constant__ int ConstVar

Texture<float,1,ReadMode> TextureVar

Per appTexture Memory

Per appConstant Memory

Devic

e

- 13 -

Discussion Points

Who has written CUDA, how have you optimized it, how long did it take?» Did you do tune using a better algorithm than trial and error?

Is there any hope to build a GPU compiler that can automatically do what CUDA programmers do?» How would you do it?

» What’s the input language? C, C++, Java, StreamIt?

Are GPUs a compiler writers best friend or worst enemy?

What about non-scientific codes, can they be mapped to GPUs?» How can GPUs be made more “general”?

eecs 583 – class 21 research topic 3: compilation for gpus university of michigan december 12,...

Documents

gpu compiler

gpu architecture

multithreaded gpu

gpu scheduling

efficiency of gpus

grad class

memory hierarchy

group members