challenges for gpu architecture

30
Challenges for GPU Architecture Michael Doggett Graphics Architecture Group April 2, 2008

Upload: others

Post on 01-Apr-2022

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Challenges for GPU Architecture

Challenges for GPU Architecture

Michael DoggettGraphics Architecture Group

April 2, 2008

Page 2: Challenges for GPU Architecture

Challenges for GPU Architecture2

Graphics Processing Unit

Architecture– CPUs vs GPUs

– AMD’s ATI RADEON 2900

Programming– Brook+, CAL, ShaderAnalyzer

Architecture Challenges– Accelerated Computing

Page 3: Challenges for GPU Architecture

Challenges for GPU Architecture3 Architecture

ATI WhiteOut Demo [2007]

Page 4: Challenges for GPU Architecture

Challenges for GPU Architecture4

Chip Design Focus Point

Throughput machinesLatency machines

Simple syncComplex sync

No OSNeeds OS

Data parallelTask parallel

Little reuseReuse and locality

Few instructions lots of dataSIMD

Hardware threading

Lots of instructions little data Out of order exec Branch prediction

GPUGPUCPUCPU

Page 5: Challenges for GPU Architecture

Challenges for GPU Architecture5

Typical CPU Operation

Wait for memory, gaps prevent peak performanceGap size varies dynamicallyHard to tolerate latency

One iteration at a timeSingle CPU unit Cannot reach 100%

Hard to prefetch dataMulti-core does not helpCluster does not helpLimited number of outstandingfetches

Page 6: Challenges for GPU Architecture

Challenges for GPU Architecture6

GPU THREADS (Lower Clock – Different Scale)

Overlapped fetch and aluMany outstanding fetches

Lots of threads Fetch unit + ALU unitFast thread switchIn-order finish

ALU units reach 100% utilizationHardware sync for final Output

Page 7: Challenges for GPU Architecture

Challenges for GPU Architecture7

RADEON2900Top LevelRed – Compute/

Fixed Function

Yellow – Cache

Unified shader

Shader R/W Cache

Instr./Const. cache

Unified texture cache

Compression

Tessellator

Z/S

tenc

il C

ache

Color Cache

VertexAssembler

Command Processor

GeometryAssembler

Rasterizer

InterpolatorsHie

rarc

hica

l Z

ShaderC

achesInstruction &

C

onstant

Vertex Index Fetch

Stre

am O

ut

L1 Texture Cache

L2 Texture Cache

Tessellator

UltraUltra--Threaded Dispatch ProcessorThreaded Dispatch Processor

Shader Export

UnifiedShader

Processors

UnifiedShader

Processors

Render Back-EndsRender Back-Ends

Texture Units

Texture Units

Mem

ory

Rea

d/W

rite

Cac

he

Setup Unit

Setup Unit

Z/S

tenc

il C

ache

Color Cache

VertexAssembler

Command Processor

GeometryAssembler

Rasterizer

InterpolatorsHie

rarc

hica

l Z

ShaderC

achesInstruction &

C

onstant

Vertex Index Fetch

Stre

am O

ut

L1 Texture Cache

L2 Texture Cache

Tessellator

UltraUltra--Threaded Dispatch ProcessorThreaded Dispatch Processor

Shader Export

UnifiedShader

Processors

UnifiedShader

Processors

Render Back-EndsRender Back-Ends

Texture Units

Texture Units

Mem

ory

Rea

d/W

rite

Cac

he

Setup Unit

Setup Unit

Page 8: Challenges for GPU Architecture

Challenges for GPU Architecture8

Ultra-Threaded Dispatch Processor/Scheduler

Main control for the shader core– All workloads have threads of 64 elements

– 100’s of threads in flight

– Threads are put to sleep when they request a slow responding resource

Arbitration policy – Age/Need/Availability

– When in doubt favor pixels

– Programmable

VertexAssembler

GeometryAssemblerInterpolatorsH

iera C

achesction &

stant

Stre

am O

L

UltraUltra--Threaded Dispatch ProcessorThreaded Dispatch Processor

che

VertexAssembler

GeometryAssemblerInterpolatorsH

iera C

achesction &

stant

Stre

am O

L

UltraUltra--Threaded Dispatch ProcessorThreaded Dispatch Processor

che

Page 9: Challenges for GPU Architecture

Challenges for GPU Architecture9

Ultra-Threaded Dispatch Processor

UltraUltra--ThreadedThreadedDispatchDispatchProcessorProcessor

Vertex ShaderCommand Queue

VS Thread 1VS Thread 2VS Thread 3

Geometry ShaderCommand Queue

GS Thread 1GS Thread 2

PIxel ShaderCommand Queue

PS Thread 1PS Thread 2PS Thread 3PS Thread 4

Arbiter

Arbiter

Arbiter

Arbiter

Arbiter

Arbiter

Arbiter

Arbiter

Sequencer

Sequencer

Sequencer

Sequencer

Sequencer

Sequencer

Sequencer

Sequencer

Texture FetchArbiter

Texture FetchSequencer

Vertex FetchArbiter

Vertex FetchSequencer

Vertex / Texture Cache

Setup EngineSetup Engine

InterpolatorsGeometry AssemblerVertex Assembler

SIMDArray

80Stream

ProcessingUnits

SIMDArray

80Stream

ProcessingUnits

SIMDArray

80Stream

ProcessingUnits

SIMDArray

80Stream

ProcessingUnits

Vertex / Texture Units

ShaderInstruction Cache

ShaderConstant C

ache

Instructionsand Control

Page 10: Challenges for GPU Architecture

Challenges for GPU Architecture10

Shader Core

4 parallel SIMD units

Each unit receives independent ALU instruction

Very Long Instruction Word (VLIW)

ALU Instruction (1 to 7 64-bit words)– 5 scalar ops – 64 bits for src/dst/cntrls/op

– 2 additional for literal constant(s)

Page 11: Challenges for GPU Architecture

Challenges for GPU Architecture11

Stream Processing Units

5 Scalar Units • Each scalar unit does FP Multiply-

Add (MAD) and integer operations• One also handles transcendental

instructions • IEEE 32-bit floating point precision

Branch Execution Unit • Flow control instructions

Up to 6 operations co-issued

General Purpose RegistersGeneral Purpose Registers

BranchBranchExecutionExecution

UnitUnit

Page 12: Challenges for GPU Architecture

Challenges for GPU Architecture12

Programming

Page 13: Challenges for GPU Architecture

Challenges for GPU Architecture13

Programming model

Vertex and pixel kernels (shaders)

Parallel loops are implicit

Performance aware code does not know how many cores or how many threads

All sorts of queues maintained under covers

All kinds of sync done implicitly

Programs are very small

Page 14: Challenges for GPU Architecture

Challenges for GPU Architecture14

Parallelism Model

All parallel operations are hidden via domain specific API calls

Developers write sequential code + kernels

Kernel operate on one vertex or pixel

Developers never deal with parallelism directly

No need for auto parallel compilers

Page 15: Challenges for GPU Architecture

Challenges for GPU Architecture15

Ruby Demo Series

Four versions – each done by experts to show of features of the chip as well as develop novel forward-looking graphics techniques

First 3 written in DirectX9, fourth in DirectX10

Page 16: Challenges for GPU Architecture

Challenges for GPU Architecture16

DX Pixel Shader Length

shader size

dem

o ve

rsio

n

1

2

3

4

0 200 400 600 800

Num Pixel Shadersdemo 1 = 140demo 2 = 163demo 3 = 312demo 4 = 250

Triangles in 1000

500

1000

1500

2000

d1 d2 d3 d4

Page 17: Challenges for GPU Architecture

Challenges for GPU Architecture17

AMD Stream Computing

Page 18: Challenges for GPU Architecture

Challenges for GPU Architecture18

GPU ShaderAnalyzer

Page 19: Challenges for GPU Architecture

Challenges for GPU Architecture19

Architecture Challenges

Page 20: Challenges for GPU Architecture

Challenges for GPU Architecture20

GPU Directions

Continued improvement and evolution in Graphics– RADEON 2900 Tessellator

Continued conversion/removal of fixed function– Which fixed function ?

Increasing importance of GPU as an accelerator– Generalization of GPU architecture

More aggressive GPU thread schedulerShader read-only via texture, write-only via color exports

Need new GPU shader architectures to meet much wider requirements– Already has a range of requirements across Graphics and

Compute

Page 21: Challenges for GPU Architecture

Challenges for GPU Architecture21

Accelerated Computing

Page 22: Challenges for GPU Architecture

Challenges for GPU Architecture22

The Accelerated Computing Imperative

Homogeneous multi-core becomes increasingly inadequate

Targeted special purpose HW solutions can offer substantially better power efficiency and throughput than traditional microprocessors when operating on some of these new data types.

Power constraints will force applications to be performance heterogeneous– Applications can target the HW device to get this power

benefit

GPUs - high power efficiency, more than 2 GFLOPs/watt– 20x > than dual-core CPU

Page 23: Challenges for GPU Architecture

Challenges for GPU Architecture23

AMD's Accelerated Computing Initiative

Discrete CPUs + GPUs

Full Integration - Fusion - Mutli-core CPU + GPU Accelerator

Compatibility will continue to be a key enabler in our industry– Need SW for new HW

How should existing Compute APIs evolve ?

Do we need new API models ?

GPU parallelism is so successful because graphics APIs are sequential

New APIs can’t tie down GPU progress

– Need to replicate success of DX, need IHV input

Can only use GPGPU APIs when performance is necessary and programmer understands machine

– Layers of computationCompilers that can target workloads at appropriate processors

Page 24: Challenges for GPU Architecture

Challenges for GPU Architecture24

Future GPU/Accelerated Computing Applications

Which applications benefit from combined CPU+GPU and how ?

What type of architectural workload coupling between CPU+GPU do these applications require ?

What are the new data and communication requirements ?

What are the costs to existing performance to broaden what we do well ?

Page 25: Challenges for GPU Architecture

Challenges for GPU Architecture25

Form Factors for Accelerated Computing

UMPC

HDTV

Handset

Home Cinema

Notebook

Desktop

Digital STB/PVR

Game consoleHome Media Server

OCUR

Embedded space has already embraced heterogeneous

computing models

Page 26: Challenges for GPU Architecture

Challenges for GPU Architecture26

Accelerated Computing

GPGPU APIs enable offloading compute to GPUs– Without heroic programming efforts

– API compatibility enables Hardware innovation without new SW

Improving performance on new HW without existing apps

Broad range of possible accelerator designs– We need both CPUs and GPUs

Amdahl's law

Page 27: Challenges for GPU Architecture

Challenges for GPU Architecture27

Lots of Challenges …

• Managing context state and exceptionsThis includes the program-visible state in the compute offload engine!Virtualizing the context state

• Communications/MessagingSimplified & fabric independent producer-consumer modelOptimized communications is a key enablerIt’s the synchronization, stupid

• Memory BW and Data Movement Keeping up with the computation rates will require increasingly capable memory systems

• New and appropriate APIsMust offer a programming model that is actually easier than today’s multi-core modelsUse abstraction to trade some performance for programmer productivityLive within bounds set by OS and Concurrent Runtimes

Page 28: Challenges for GPU Architecture

Challenges for GPU Architecture28

Processor World Map

CPU

GPU

HC

MPP

Page 29: Challenges for GPU Architecture

Challenges for GPU Architecture29

Questions ?

Internships

ATI Fellowships

Michael.doggett ‘at’ amd.com

Page 30: Challenges for GPU Architecture

Challenges for GPU Architecture30

References

“Issues and Challenges in Compiling for Graphics Processors”, Norm Rubin, Code Generation and Optimization 8 April, 2008

“The Role of Accelerated Computing in the Multi-Core Era”, Chuck Moore, March 2008