eecs 583 – class 21 research topic 3: compilation for gpus university of michigan december 12,...

13
EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

Upload: philippa-lynch

Post on 17-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

EECS 583 – Class 21Research Topic 3: Compilation for GPUs

University of Michigan

December 12, 2011 – Last Class!!

Page 2: EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

- 2 -

Announcements & Reading Material

This class reading» “Program optimization space pruning for a multithreaded GPU,”

S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008.

Project demos» Dec 13-16, 19 (19th is full)

» Send me email with date and a few timeslots if your group does not have a slot

» Almost all have signed up already

Page 3: EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

- 3 -

Project Demos

Demo format» Each group gets 30 mins

Strict deadlines because many back to back groups Don’t be late!

» Plan for 20 mins of presentation (no more!), 10 mins questions Some slides are helpful, try to have all group members say

something Talk about what you did (basic idea, previous work), how you did it

(approach + implementation), and results Demo or real code examples are good

Report» 5 pg double spaced including figures – same content as

presentation

» Due either when you do you demo or Dec 19 at 6pm

Page 4: EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

- 4 -

Midterm Results

Mean: 97.9 StdDev: 13.9 High: 128 Low: 50

If you did poorly, all is not lost. This is a grad class, the project is by far most important!!

Answer key on the course webpage, pick up graded exams from Daya

Page 5: EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

- 5 -

Why GPUs?

5

Page 6: EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

- 6 -

Efficiency of GPUs

6

High Memory

Bandwidth

GTX 285 : 159 GB/Seci7 : 32 GB/Sec

High FlopRate

i7 :102 GFLOPS

GTX 285 :1062 GFLOPS

i7 : 51 GFLOPS

GTX 285 : 88.5 GFLOPS

GTX 480 : 168 GFLOPS

High FlopPer Watt

GTX 285 : 5.2 GFLOP/W

i7 : 0.78 GFLOP/W

High FlopPer Dollar

GTX 285 : 3.54 GFLOP/$i7 : 0.36 GFLOP/$

Page 7: EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

- 7 -

GPU Architecture

7

Shared

Regs

0 1

2 3

4 5

6 7

Interconnection Network

Global Memory (Device Memory)PCIe

Bridge

CPU HostMemory

Shared

Regs

0 1

2 3

4 5

6 7

Shared

Regs

0 1

2 3

4 5

6 7

Shared

Regs

0 1

2 3

4 5

6 7

SM 0 SM 1 SM 2 SM 29

Page 8: EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

- 8 -

CUDA

“Compute Unified Device Architecture”

General purpose programming model» User kicks off batches of threads on the GPU

Advantages of CUDA» Interface designed for compute - graphics free API

» Orchestration of on chip cores

» Explicit GPU memory management

» Full support for Integer and bitwise operations

8

Page 9: EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

- 9 -

Programming Model

9

Host

Kernel 1

Kernel 2

Grid 1

Grid 2

DeviceTi

me

Page 10: EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

- 10 -

Grid 1

GPU Scheduling

10

SM 0

Shared

Regs

0 1

2 3

4 5

6 7

SM 1

Shared

Regs

0 1

2 3

4 5

6 7

SM 2

Shared

Regs

0 1

2 3

4 5

6 7

SM 3

Shared

Regs

0 1

2 3

4 5

6 7

SM 30

Shared

Regs

0 1

2 3

4 5

6 7

Page 11: EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

- 11 -

Warp Generation

Block 0

Block 1

Block 3

Shared

Registers

0 1

2

4 5

3

6 7

SM0

11

Block 2

ThreadId

0 31 32 63Warp 0 Warp 1

Page 12: EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

- 12 -

Memory Hierarchy

12

Per BlockShared Memory

__shared__ int SharedVar

Block 0

Per-threadRegister

int LocalVarArray[10]

Per-threadLocal Memory

int RegisterVarThread 0

Grid 0

Per appGlobal Memory

Host

__global__ int GlobalVar

__constant__ int ConstVar

Texture<float,1,ReadMode> TextureVar

Per appTexture Memory

Per appConstant Memory

Devic

e

Page 13: EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

- 13 -

Discussion Points

Who has written CUDA, how have you optimized it, how long did it take?» Did you do tune using a better algorithm than trial and error?

Is there any hope to build a GPU compiler that can automatically do what CUDA programmers do?» How would you do it?

» What’s the input language? C, C++, Java, StreamIt?

Are GPUs a compiler writers best friend or worst enemy?

What about non-scientific codes, can they be mapped to GPUs?» How can GPUs be made more “general”?