real-time graphics...

21
5/1/2007 1 Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007 Kurt Akeley Pat Hanrahan http://graphics.stanford.edu/cs448-07-spring/ Programming GPUs Outline History Contemporary GPUs Programming model Implementation GPGPU: non-graphical computation using GPUs Caveat CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007 This is not a programming tutorial You will gain some understanding of the architectural issues that underlie programmable GPUs

Upload: hahanh

Post on 27-Mar-2018

224 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

1

Real-Time Graphics Architecture

Lecture 9: Programming GPUs

p

Kurt Akeley

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Kurt Akeley

Pat Hanrahanhttp://graphics.stanford.edu/cs448-07-spring/

Programming GPUs

Outline

History

Contemporary GPUsProgramming model

Implementation

GPGPU: non-graphical computation using GPUs

Caveat

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

This is not a programming tutorial

You will gain some understanding of the architectural issues that underlie programmable GPUs

Page 2: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

2

GPU programmability is not new

Essentially all GPUs are (and have been) programmable

The questions are:

Who gets to program them?

And when?

Example systems from SGI’s early history:

Geometry Engine (1981)

GE4 (1987)

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

GE4 (1987)

VGXT Image Engine (1991)

Dynamic code generation

Geometry engine (1981)

Four 32-bit floating-point units

Jim-Clark format

2’s-complement e and m2 s complement e and m

Programmed using John Hennessy’s SLIM(Stanford Language for the Implementation of Microcode)

~500-term PLA, hard-wired at fabrication time

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Programmed by Jim Clark and Marc Hannah

Page 3: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

3

SGI GE4 (1987)

Weitek 3332 floating-point ALU

~4k ~64-bit lines of microcode in SRAM

Programmed in low-level assembly code

No run-time code changes were made (though they could have been)

Programmed by several SGI employees, including Kurt

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

p y , g

Wire-wrapped GE4 prototype

VGXT image engine (1991)

Tiny code space to implement

Texture filter and application

Frame buffer operations

Code overlays loaded on-the-fly

Code sequences generated prior to run-time

Fields (e.g., blend function) filled in at run time

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Page 4: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

4

Dynamic code generation

SUN graphics card

Bresenham lines

Abrash talk …

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Application programmability is not new

GPUs have a long history of application programmability

Example systems:

Ikonas graphics system (1978)

Trancept TAAC-1 (1987)

Multi-pass vector processing (2000)

Register combiners (2001)

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Page 5: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

5

Ikonas graphics system (1978)

Multi-board system

32-bit integer processor (16x16 multiply)

16 bit graphics processor (16 pixel parallel render)16-bit graphics processor (16 pixel parallel render)

Hardware Z-buffer

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Image credit: www.virhistory.com/ikonas/ikonas.html

Trancept TAAC-1 (1987)

C-language microcode compiler

200-bit microcode word

Sold to Sun Microsystems in 1987

“Customers got a C language microcode compiler and were strongly encouraged (!) to develop their own code” –Nick England, www virhistory com/trancept/

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

www.virhistory.com/trancept/trancept.html

Image credit: www.virhistory.com/trancept/trancept.html

Page 6: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

6

Multi-pass vector processing (2000)

#include “marble.h”surface marble() {varying color a;uniform string fx;uniform float x; x = ½;fx = “noisebw.tx”;

Treat OpenGL as a very long instruction word

Compute vector stylefx noisebw.tx ;FB = texture(tx, scale(x,x,x));repeat(3) {

x = x * 0.5;FB *= 0.5;FB += texture(tx, scale(x,x,x));

}FB = lookup(FB,tab);a = FB;FB = diffuse;FB *= a;FB += environment(“env”);

}

Apply inst. to all pixels

Build up final image in many passes

Peercy, Olano, Airey, and Ungar, Interactive Multi-Pass Programmable Shading, SIGGRAPH 2000

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

g,

(Figure adapted from the SIGGRAPH paper)

Register combiners (2001)

ARB_texture_env_combine

“New texture environment function COMBINE ARB allows texture combiner COMBINE_ARB allows texture combiner operations, including REPLACE, MODULATE, ADD, … , INTERPOLATE.”

ARB_texture_env_combine

“Adds the capability to use the texture color from other texture units as sources to the COMBINE ARB environment function ”

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

COMBINE_ARB environment function.”

ARB_texture_env_dot3

Adds a dot-product operation to the texture combiner operations.

Page 7: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

7

Summary of history

General-purpose processor with attached unitsCould not compete in performance/cost ratio“Heathkit” marketing

Multi-pass vector processingLow performanceWasteful use of bandwidth (against the trends)

Register combinersRube Goldberg design qualityHopelessly increasing complexity

P bl i li t

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Programmable pipeline stagesNot accessible to application programmersReason (at SGI) was

Unable to keep the programming model consistent from product to product, andUnable to use software to hide the inconsistencies

Contemporary GPU Programming Model

Application-programmable pipeline stages

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Page 8: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

8

Programming model (Direct3D 10 pipeline)

Input Assembler

S l T

VertexesTraditional graphics pipeline

(except location of geometry shader)

Three application-

Memory

Rasterizer

Vertex ShaderSampler

ConstantTexture

Geometry ShaderSampler

ConstantTexture

Three applicationprogrammable stages operate on independent vertexes, primitives, and pixel fragments

All stages share same programming model, and (most) resources

Single shared memory, but

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Output Merger

Pixel ShaderSampler

ConstantTexture

Target

Single shared memory, but memory objects remain distinct

Application programmable

Direct3D 10 programmable core

Input registers

(16) 4x32b

Textures

(128)

Complete instruction set

Floating-point operations

Integer operations (new)

From previous stage

Temp registers

(32/4096) 4x32b

FP unit Special

Const buffers

(16) 4096x32b

Samplers

(16)

Int ALU

Integer operations (new)

Branching operations

Unlimited code length

Vector instructions

Lots of temporary state

Independent operation

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

4x32b Function

Output registers

(16) 4x32b

4x32bEach vertex, or

Each primitive, or

Each pixel fragment

To next stage

Page 9: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

9

Vector instructions

Vector vs. scalar

4-component vector arithmetic is typical of gfx

But as shaders get longer, an increasing fraction of their code is scalar

Some GPUs support dual-issue split vectors:

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Programming interface

Assembly code

Low level and error prone

Not available in Direct3D 10

PARAM mat[4] = { state.matrix.modelview };DP4 result.x, vertex, mat[0];DP4 result.y, vertex, mat[1];DP4 result.z, vertex, mat[2];

C-like shading language

Cg: layered above assembly language

HLSL, OpenGL Shading Language: intrinsic

Fixed-function stages

L t h li ti

DP4 result.w, vertex, mat[3];

Vertex transformation(ARB_vertex_program)

void transform(float4 position : POSITION,

out float4 oPosition : POSITION,

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Lost when a application-defined shader is used

Not included in Direct3D 10

Were to be omitted from OpenGL 2.0

uniform float4x4 modelView)

{

oPosition = mul(modelView, position);

}

Vertex transformation(Cg)

Page 10: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

10

Parallel coding complexity is minimized

Complexities are handled by the implementation:

DependenciesdOrdering

Sorting

ScalabilityComputation

Bandwidth

Load balancing (much more implementation work)

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Load balancing (much more implementation work)

All threads are independent. Message passing is

Not required

Not allowed

Parallel coding complexity is minimized

Correctness constraints include:

Cannot modify textures directly from shaders

Cannot bind a target as both source and sinkTexture and render target

Vertex buffer and stream output

Cannot “union” different data formats

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Page 11: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

11

Performance considerations

Programmer can largely ignore:

Memory latency

Floating-point cost (same as integer)

Special function cost

Texture filtering cost

Programmer should pay attention to:

Divergent branching

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Divergent branching

Number of registers used

Size of storage formats (bandwidth)

Re-specification of Z values (defeats Z cull)

Contemporary GPU Implementation

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Page 12: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

12

GeForce 7800Parallelism

Cull / Clip / Setup

Shader Instruction DispatchZ-Cull

Host / FW / VTF

Figure courtesy of NVIDIA

L2 Tex

Fragment Crossbar

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

MemoryPartition

MemoryPartition

MemoryPartition

MemoryPartition

DRAM(s) DRAM(s) DRAM(s) DRAM(s)

SIMD control

SIMD = Single Instruction, Multiple Data

One or more SIMD blocks; each block includes many data paths controlled by a single sequencer

Data Path

Sequencer

by a single sequencer

SIMD benefits:

Reduced circuitry

Synchronization of eventsReduced instruction-fetch bandwidth

Better texture cache utilization

Data Path

Data Path

Data Path

Generic SIMD

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

SIMD liabilities:

Branch divergence reduces performance

Branch convergence is complicated to implement

Geometry Engine

Page 13: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

13

Multiple threads hide memory latency

Scheduler maintains a queue of ready-to-run threads

When a thread blocks, another is started

Blocking determination may beg y

Simple (any attempted memory access), or

Advanced (block only on data dependencies)

Thread context storage is significant – product of

Number of processors,

Threads per processor, and

State per thread (registers pc )

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

State per thread (registers, pc, …)

Fixed-size memory pool may be insufficient for large thread contexts

Large state/thread too few threads/processor

Local registers may not be “free”

Latency hiding consumes memory

Latency hiding mechanisms consume memory

Cache (CPU, ~texture)Rasterizer

Fragment fifo (texture pre-fetching)

Shader context (threaded execution)

GPUs devote less die area to cache than CPUs

B t th l ti i

Request FIFO

FIFO

Rasterizer

Reorder Buffer

Texture

Texturememory

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

But the real question is: what area is devoted to latency-hiding memory?

Hint: ask the invited experts

Texture Apply

Filter

From lecture 7

Page 14: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

14

GeForce 8800 thread scheduling

Threads are scheduled in SIMD blocks

4 threads per quad (2x2 fragment unit)

16 threads per “warp” (NVIDIA GeForce 8800)16 threads per “warp” (NVIDIA GeForce 8800)

Thread scheduling is critical and complicated

David Kirk assessment of 8800 load balancing: “3x more difficult overall”

Sort-last-fragment, with load balancing forVertex processors

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

p

Geometry (primitive) processors

Pixel fragment processors

Reference: Mike Shebanow, ECE 498 AL: Programming Massively Parallel Computers, Lecture 12

Thread scheduling complexity

Two parts to scheduling decisions:

What class of thread should be run?Vertex, primitive, fragment

When should a thread (group) be blocked?When a memory access is attempted?

When a data dependency is reached?

Data-dependent blocking is worthwhile, because it conserves thread-context memory space

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

conserves thread-context memory space

Has been an ATI advantage over NVIDIA

Did this contribute to 8800 scheduling difficulty?

Page 15: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

15

Abstraction distance

Actual instruction set may not match “advertised” one

Was true when assembly language was supportedl blComplex post-assembly run-time optimization

Layered languages (like Cg) 2-layer optimization

Even easier with Direct3D 10

Example: use scalar processor to emulate vector operations

L b t ti di t ’t ti l i th ’90

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Large abstraction distance wasn’t practical in the ’90s

CPUs are much faster now

Software technology has come a long way

Why isn’t Output Merger programmable?

OM is highly optimized

Memory access is the GPU limiter in many cases

Input Assembler

S l T

Vertexes

many cases

OM has data hazards

All programmable interfaces are memory read-only

Memory

Rasterizer

Vertex ShaderSampler

ConstantTexture

Geometry ShaderSampler

ConstantTexture

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Output Merger

Pixel ShaderSampler

ConstantTexture

TargetApplication programmable

Page 16: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

16

GPGPU

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Performance motivates interest in GPGPU

7x performance advantage to GPUs today:

47 GFLOPS Intel core2 extreme e680047 GFLOPS – Intel core2 extreme e6800

345 GFLOPS – NVIDIA GeForce 8800 GTX

350 GFLOPS – ATI X1900

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Page 17: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

17

Rate-of-change of performance is the key

CPU and GPU rates differ:

GPU: ~2x CAGR

CPU: <1 5x CAGR recently much lessCPU: <1.5x CAGR, recently much less

Will this 10x/decade disparity persist in the multi-core era?

OP

S

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

GFL

O

G80 = GeForce 8800 GTX

G71 = GeForce 7900 GTX

G70 = GeForce 7800 GTX

NV40 = GeForce 6800 Ultra

NV35 = GeForce FX 5950 Ultra

NV30 = GeForce FX 5800

NVIDIA Corporation 2006

Additional GPGPU motivations

Visible success stories

Folding@home gets 20-40x speedup over PCs

Matrix multiply achieves 120 GFLOPS on ATI X1900Matrix multiply achieves 120 GFLOPS on ATI X1900

Ubiquity of GPU hardware

Every PC has a GPU of some kind

This is a first for an attached hardware accelerator

Evolving GPU architecture

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Rich instruction set

Highly parallel

Improving GPU CPU data transfer rates

Page 18: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

18

GPUs are high-performance multiprocessors

GeForce 8800 block diagram (NVIDIA Corporation 2006)

S t / R t / ZC llD t A bl

Host

SP SP

TF

Thre

ad P

roce

ssor

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Data Assembler

SP SP

TF

SP SP

TF

SP SP

TF

SP SP

TF

SP SP

TF

SP SP

TF

SP SP

TF

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

L2

FB

L1 TL1 L1 L1 L1 L1 L1 L1

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Scatter, Gather, and Reduction

Frame buffer mindset: fragment associated with a specific region of memory

Early programmable GPUs (2001) Today’s GPUs (2007)

Gather Texture fetch Texture fetch

S tt U f t t

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Scatter -- Uses fragment sort

Reduction -- Occlusion query

Page 19: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

19

GPGPU concerns

Application specificity. Requires:

High arithmetic intensity (compute / bandwidth)

High branching coherence

Ability to decompose to 1000s of threads

Computation quality

No memory error detection/correction

No double precision

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

No double precision

Incomplete (but improving) IEEE single precision

Possible end of divergence of CPU and GPU performance

Abysmal history of attached processor success

How will CPUs and GPUs coexist?

Possible outcomes:

CPU instruction set additions

Heterogeneous multi-coreShared memory

Differentiated memory

Generalized GPUs

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

What do/will we mean by GPU?

Anything graphics related?

Massively multi-threaded to hide memory latency?

Page 20: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

20

Summary

Long history of GPU programmability, but current situation (app-programmable pipeline stages) is new

Current GPUs are easily programmed (for graphics) and Current GPUs are easily programmed (for graphics) and give programmers amazing power

Current GPUs are high-performance multiprocessors

Exciting things are happening in computer architecture, and GPUs will play an important role

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Suggested readings

Wen-Mei Hwu and David Kirk, ECE 498 AL: Programming Massively Parallel Computers,

http://courses.ece.uiuc.edu/ece498/al/index.html

Peercy Olano, Airey, and Ungar, Interactive Multi-Pass Programmable Shading, SIGGRAPH 2000.

Fernando, GPU Gems, Addison Wesley, 2004.

Pharr and Fernando, GPU Gems 2, Addison Wesley, 2005.

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Page 21: Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

21

Real-Time Graphics Architecture

Lecture 9: Programming GPUs

p

Kurt Akeley

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Kurt Akeley

Pat Hanrahanhttp://graphics.stanford.edu/cs448-07-spring/