graphics processors norm rubin – compiler architect – [email protected]

Graphics processors

Norm Rubin – compiler architect –

[email protected]

Feb 15, 2005

2

Size of market

• Many millions of gpu’s shipped per month• The 3d market is entertainment (games)• Each new generation of gpu adds enough

performance to support a new version of a game.

• Each time a game is released, player have to replace hardware to run the game.

• Game industry is larger then Hollywood.

Feb 15, 2005

3

Technology view

Not enough ok Too good

performance / function

gpu cpu

Proprietary Commodity

architecture

interfaces

Mutable Locked down

Feb 15, 2005

4

How much headroom

• Pixar uses 100,000 min of compute per min of image

• Gpu’s are real time so 100,000 = 20 doubles • Most optimistic marketing version of Moore’s

law – performance doubles every 6 months

• So there is 10 years to go.

Feb 15, 2005

5

Application space

• Problems are embarrassingly parallel • Problems are big, screen 1000 x 1000, program

runs per pixel, including some pixels that are behind others so 10* 1000 * 1000 calls per frame * 20-60 frames per second

• Run the same program over and over so • Gpus are SIMD machines

Feb 15, 2005

6

SIMD

• There are many units executing in parallel– These are in lock-step, executing the same

instruction on different pixels/vertices at the same time

– Dynamic flow control can cause inefficiencies in such an architecture since different pixels/vertices can take different code paths

– Dynamic branching is not always a performance win

– For an if…then…else, need to execute both sides, turning processors on and off.

Feb 15, 2005

7

Application space

• Many values are coherent – values in neighbor pixels are close.

• Compute coherent variables at selected points use interpolation to find the intermediate values

• Today programmer specifies which variables are coherent by splitting programs in two.

Feb 15, 2005

8

Application space

• Common subproblem is texture filtering – Evaluate some array of memory around a stencil and

combine

– Provide a small fixed set of stencil patterns in hardware

– You could think of this as slighty smart memory

– Hardware support for 1-3 d arrays and several filtering functions

– Exact stencil patterns and combining operations are proprietary (some look better then others)

Feb 15, 2005

9

Application space

• Little communication between processing elements

• Approximate spatial derivative by 2x2 difference operator

• Forces all machine designs to work on multiples of four pixels

Feb 15, 2005

10

Application space

• Throughput is important• Use threading to cover latency • The chips can support hundreds of threads, and

can switch from thread to thread every cycle– No thread switch overhead

– Hardware scheduler and thread system

– Compiler knows about threads and splits resources over threads

• Caches are very different – can only cover spatial locality

Feb 15, 2005

11

Programming model

• Performance is much less then users want • Min of 100,000 times less• Most developers write each program at least

four times – Xbox

– Playstation

– Ati top machine

– Nvidia top machine

• Programs are in two parts: Vertex and Pixel shaders.

Feb 15, 2005

12

Programming model 2

• Programs could be written in a high level language (C like) HLSL/OGL2

• Or in virtual assembly language (DirectX, …)– Almost one dialect per chip

– While virtual languages but physical resources.

• developers review virtual machine listings for performance

• developers ship virtual assembly language.

Feb 15, 2005

13

Programming model 3

• At game startup – virtual assembly language is JIT compiled to real machine language –

– Drastic change in resource requirements

– Somewhat hard to debug

– Hard to identify performance bottlenecks

• Even though applications could build code on the fly, developers pretest everything – they want the most performance to get the best looking image. Only approximate what they really want.

Feb 15, 2005

14

Programmable PipelineProgrammable Pipeline

Vertex Data(Model space)

Fixed Function Transform and

Lighting

Clipping and Viewport Mapping

Texture Stages

Fog, Alpha, Stencil Depth Testing

Geometry Stage

Rasterizer Stage

Vertex Shader

Pixel Shader

Feb 15, 2005

15

Vertex Processing FlowVertex Processing Flow

PositionNormalTexture CoordinatesEtc.

Per-Vertex DataView MatrixProjection MatrixSkin/Bone MatricesLight PositionsEtc.

Constants

Temporary Registers Vertex Shader

Instructions

Triangle Mesh

Vertex Shader Engine

Position“Texture” CoordinatesColor(s)

Feb 15, 2005

16

Vertex Shader• Input:

– Program specifies vertex data• Position• Normal• Vertex color• Texture coordinate(s)• …

– Data is sent to the graphics card and processed by the vertex shader

• Output– Vertex shader computes output quantities

• Position• Vertex color: diffuse and specular• Texture coordinate(s)

– Sent to rasterizer via interpolators

Feb 15, 2005

17

Pixel Processing FlowPixel Processing Flow

Temporary Registers

“Texture” CoordinatesColor(s)

Light ColorsAmbient Lighting ColorsEtc.

Constants

Pixel Shader

Instructions

Interpolated Values

Textures

Pixel Shader

Engine

Color Multi-Render Target

Feb 15, 2005

18

Program sizes

• Most programs are very small • 100 virtual instructions would be a large

program• Basic data type is a four element vector of

floats• Integer data types are not yet available• Dynamic branching is new• Small amount of nesting allowed

Feb 15, 2005

19

polygons• Polygon Budget– Ruby : 75,000

– Optico: 50,000

– Ninja: 25,000

– Environment: 100,000

– Props: 50,000

• Lighting Limits– 3 Dynamic lights per shot (1 shadow casting)

– Lightmaps used for set

• Animation Limits– 35 total blend shapes

– 5 simultaneous blend shapes

– 4 weighted bones per vertex

– Number of on-screen characters limited to 4 at once

Feb 15, 2005

20

Shader Breakdown

• Depth of Field

• Hair

• Skin

Feb 15, 2005

21

Depth Of FieldDepth Of Field

Feb 15, 2005

22

Depth Of FieldDepth Of Field

Feb 15, 2005

24

Shader Breakdown

• Glows

• Motion Blur

• Reflections

Feb 15, 2005

25

Glows

Feb 15, 2005

26

Motion Blur

Feb 15, 2005

27

Reflections

Feb 15, 2005

28

Hardware view

• X1900 • Xbox 360

• Both machines are current

Feb 15, 2005

30

X1900

PixelShaderEngine

Z /

Ste

ncil B

uff

er

Cach

e

Setup Engine

VertexShaderEngine

Backface Cull

Perspective DivideClip

Viewport Transform

Backface Cull

Perspective DivideClip

Viewport Transform

Vertex Data

Textu

re C

ach

eTextu

re U

nits

Textu

re U

nits

Textu

re U

nits

Textu

re U

nits

Decom

pre

ss

Com

pre

ss

Decom

pre

ss

Com

pre

ss

Ultra-ThreadingDispatchProcessor

Ultra-ThreadingDispatchProcessor

Decom

pre

ss

Com

pre

ss

Decom

pre

ss

Com

pre

ss

Hie

rarc

hic

al

Z T

est

Geometry AssemblyRasterization

Geometry AssemblyRasterizationInterpolators

General Purpose Register ArraysGeneral Purpose Register ArraysGeneral Purpose Register ArraysGeneral Purpose Register Arrays

Feb 15, 2005

31

Quad Pixel Shader CoreVector ALU 2Vector ALU 2

Vector ALU 1Vector ALU 1ScalarScalarALUALU

11

ScalarScalarALUALU

22

BranchBranchExecutionExecution

UnitUnit

Vector ALU 2Vector ALU 2

Vector ALU 1Vector ALU 1ScalarScalarALUALU

11

ScalarScalarALUALU

22

BranchBranchExecutionExecution

UnitUnit

PixelShaderEngine

Z /

Ste

ncil B

uff

er

Cach

e

Setup Engine

VertexShaderEngine

Hie

rarc

hic

al

Z T

est

I nterpolators

Geometry Assembly

Rasterization

Backface Cull

Perspective Divide

Clip

Viewport Transform

Vertex Data

Textu

reC

ach

e

General Purpose Register ArraysGeneral Purpose Register Arrays

Ultra-ThreadedDispatch Processor

Ultra-ThreadedDispatch Processor

Textu

re U

nits

Textu

re U

nits

Deco

mp

ress

Com

pre

ss

Deco

mp

ress

Com

pre

ss

Deco

mp

ress

Com

pre

ss

Pixel Shader ProcessorPer Clock Cycle:

1 vec3 ADD + input modifier1 scalar ADD + input modifier1 vec3 ADD/MUL/MADD1 scalar ADD/MUL/MADD1 flow control instruction

Texture Address Units

1 texture address instructionsper unit per clock cycle

TextureTextureAddressAddress

UnitUnit11


UnitUnit22


UnitUnit33


UnitUnit44

Pixel Shader Processors

Feb 15, 2005

32

Vertex Engine

• Upgraded to support SM3.0– Dynamic flow control– 1,024 instructions (practically

unlimited with flow control)– More temporary registers

• 8 Vertex Shader Processors– Each can handle 2 shader

instructions per clock

– 10 billion instructions per second

Backface Cull

Perspective Divide

Clip

Viewport Transform

Vertex Data

128-bitVector

ALU

32-bitScalarALU

Flow Control

ToSetupEngine

Feb 15, 2005

33

Ring Bus Memory Controller

• Supports today’s fastest graphics memory devices

– GDDR3, 48+ GB/sec– GDDR4, The future

• 512-bit Ring Bus– Simplifies layout and enables

extreme memory clock scaling• New Cache Design

– Fully Associative for more optimal performance

• Improved Hyper Z– Better compression and hidden

surface removal• Programmable Arbitration Logic

– Maximizes memory efficiency– Can be upgraded via software

Feb 15, 2005

34

Memory Channels - 4x Improvement in Random Access over X850

32-bitchannel

32-bitchannel

32-bitchannel

32-bitchannel

32-bitchannel

32-bitchannel

32-bitchannel

32-bitchannel

Memory ControllerMemory Controller

64-bitchannel

64-bitchannel

64-bitchannel

64-bitchannel

Memory ControllerMemory Controller

256 bit interface

Memory DevicesMemory Devices

Memory DevicesMemory DevicesRadeonX850

4x64-bitchannels

4 banks Per Dram

RadeonX850

4x64-bitchannels

4 banks Per Dram

Radeon X1900

8x32-bitchannels

8 Banks Per Dram

Radeon X1900

8x32-bitchannels

8 Banks Per Dram

Feb 15, 2005

35

Cache Design

GraphicsMemory

GraphicsMemory CacheCache

GraphicsMemory

GraphicsMemory CacheCache

DirectMappedCache

DirectMappedCache

FullyAssociative

Cache

FullyAssociative

Cache

• Fully Associative Caches– Cache lines can map to any location in

external memory– Earlier designs used Direct Mapped &

N-Way Associative Caches – Could only access limited blocks of

external memory• Texture, Color, Z & Stencil caches are all

now fully associative– Reduces memory bandwidth

requirements– Minimizes cache contention stalls– Optimized game performance– Gains up to 25% clock for clock in

fill/bandwidth bound cases

Feb 15, 2005

36

Xbox

• 3.2GHz Custom IBM Central Processor • Three CPU Cores • Two Threads Per core • VMX Unit Per Core • 128 VMX Registers Per Thread • 1MB L2 Cache (Lockable by Graphics Processor) • 500MHz Custom ATI Graphics Processor • Unified Shader Core • 48 ALU’s for Vertex or Pixel Shader processing • 16 Filtered & 16 Unfiltered Texture samples per clock • 10MB eDRAM Framebuffer • 512MB System RAM • Unified Memory Architecture (UMA) • 128-bit interface • 700MHz GDDR3 RAM

Feb 15, 2005

37

CommandCommandProcessorProcessor

Memory HubMemory Hub

VertexVertexGrouperGrouper

PrimitivePrimitiveAssemblyAssembly

ShaderShaderInterpInterp

ShaderShaderInterpInterp

SequencerSequencer ShaderShaderPipePipe(x16)(x16)

Vertex CacheVertex Cache

TextureTexturePipePipe




ShaderShaderPipePipe(x16)(x16)

ShaderShaderPipePipe(x16)(x16)

PipePipeCommComm

256 GB/sec

Texture CacheTexture Cache

ScanScanConverterConverter

Z/Alpha/StencilProcessors

Z/Alpha/StencilProcessors

10MB 10MB DRAMDRAM

Architecture

Feb 15, 2005

38

Adaptive Shader Array

•Unified shader architecture•One processor type•Dynamic load balancing• Pixel and vertex processing where and when they’re needed

–48 shaders• 120 billion operations per second

ATI

Can you think of anything that implies fusion of any kind? Maybe if we have a picture of an engine of some sort

Feb 15, 2005

39

Feb 15, 2005

40

Some interesting problems

• Coherence (branch prediction?)• What are the right instructions• Can you do non graphics applications• Programming language• Threading by compiler• Off line compile?

Feb 15, 2005

41

Implications for programming languages

• GPU – can convince people to use a new language if you can prove it is faster, even if it means lots of changes

• Desktop CPU – have to prove it can meet some other (non-performance/function) need

• Top of the line price for GPU going up- top of the line desktop CPU price going down, lots of change to do cool design.

• Less need to be backward compatible.

Feb 15, 2005

42

More info

• http://www.ati.com/developer/index.html

graphics processors norm rubin – compiler architect – [email protected]

Documents

spatial locality slide

application space problems

application space throughput

different pixelsvertices

performance win

coherent values

machine programs

programming model performance