university of virginia © kevin skadron, 2008 kevin skadron university of virginia dept. of computer...

UNIVERSITY OF VIRGINIA©

Kev

in S

kadr

on, 2

008

Kevin Skadron

University of Virginia Dept. of Computer Science

LAVA Lab

and

NVIDIA Research

Massively Parallel Graphics Processors in a

Multicore, Power-Limited Era


Kev

in S

kadr

on, 2

008

2

Outline of Overall Talk

Why multicore?How did we get into this jam?

What next?How do we get out of this jam?

Are heterogeneous architectures the answer?

What is the role of graphics processors (GPUs)?

Role in system architecture

Architecture and programming overview(2nd half of talk)


Kev

in S

kadr

on, 2

008

3

Disclaimer

The opinions here are my own as a computer engineer. They represent my interpretation of technology trends and associated opportunities. They do not in any way represent positions or plans of University of Virginia or NVIDIA.


Kev

in S

kadr

on, 2

008

4

Why Multicore?How did we get here? Combination of both “ILP wall” and “power wall”

ILP wall: wider superscalar & more aggressive OO execution diminishing returns

Boost single-thread performance boost frequency

Power wall: boosting frequency to keep up with Moore’s Law (2X per generation) is expensive

Natural frequency growth with technology scaling is only ~20-30% per generation

– Don’t need expensive microarchitectures just for thisFaster frequency growth requires

– Aggressive circuits (expensive)– Very deep pipeline – 30+ stages? (expensive)– Power-saving techniques weren’t able to compensate

No longer worth the Si, cooling costs, or battery life


Kev

in S

kadr

on, 2

008

5

Single-core Watts/Spec

0.001

0.01

0.1

1

0 1 10 100 1000 10000

Spec2000

Wat

ts/S

pec

intel 386

intel 486

intel pentium

intel pentium 2

intel pentium 3

intel pentium 4

intel itanium

Alpha 21064

Alpha 21164

Alpha 21264

Sparc

SuperSparc

Sparc64

Mips

HP PA

Power PC

AMD K6

AMD K7

AMD x86-64

(through 2005)

(courtesy Mark Horowitz)(Normalized to same technology node)


Kev

in S

kadr

on, 2

008

6

The Multi-core RevolutionCan’t make a single core much faster

But need to maintain profit margins

More and more cache diminishing returns

“New” Moore’s LawSame core is 2X smaller per generation, can double # cores

Focus on throughputCan use smaller, lower-power cores (even in-order issue)

Make cores multi-threaded

Trade single-thread performance for Better throughput

Lower power density

Maybe keep one aggressive core so we don’t make single-thread performance worse


Kev

in S

kadr

on, 2

008

7

Can Parallelism Succeed?

Parallel computing never took off as a commodity

Expensive and rare despite decades of investments

What’s different this time?Need – power wall!

Opportunity – commodity hardware is out there now in hundreds of millions of PCs and servers

x86 multicores – 6-way multicore has been announced

GPUs – 100+ multicore!

This breaks the chicken-and-egg problem

– Parallel language and software creators no longer need to wait for parallel hadware


Kev

in S

kadr

on, 2

008

8

What To Do With All Those Cores?

PC workloads have limited number of independent tasks

Parallel programming is hard, isn’t it?

Well, at least we solved the power wall…or did we?


Kev

in S

kadr

on, 2

008

9

Moore’s Law and Dennard Scaling

Moore’s Law: transistor density doubles every N years (currently N ~ 2)

Dennard Scaling (constant electric field)Shrink feature size by k (typ. 0.7), hold electric field constant

Area scales by k2 (1/2) , C, V, delay reduce by k

P CV2f P goes down by k2


Kev

in S

kadr

on, 2

008

10

Moore’s Law and Dennard Scaling

Works well for “shrinks”

Doesn’t apply to high endGenerally keep area constant, use doubled transistor density to add more features, so total C doesn’t scale

Perf = Insts * CPI * cycle time

Out of order execution, wide superscalar, aggressive speculation to boost instruction level parallelism (improve CPI)

Aggressive pipelining, circuits, etc. to boost frequency (cycle time) beyond “natural” rate

Leakage has been going up

Power and power density went up, not down


Kev

in S

kadr

on, 2

008

11

Actual PowerM

ax

Po

we

r (W

att

s)

i386 i386

i486 i486

Pentium® Pentium®

Pentium® w/MMX tech.

Pentium® w/MMX tech.

1

10

100

Pentium® Pro Pentium® Pro

Pentium® II Pentium® II

Pentium® 4Pentium® 4Pentium® 4Pentium® 4

Pentium® III Pentium® III

Source: Intel

Core 2 Duo


Kev

in S

kadr

on, 2

008

12

Power Wall Redux

Vdd scaling is coming to a haltCurrently 0.9-1.0V, scaling only ~2.5%/gen [ITRS’06]

Even if we generously assume C scales and frequency is flat

P CV2f 0.7 (0.9752) (1) = 0.66

Power density goes upP/A = 0.66/0.5 = 1.33

And this is very optimistic, because C probably scales more like 0.8 or 0.9, and we want frequency to go up, so a more likely number is 1.5-1.75X

If we keep %-area dedicated to all the cores the same -- total power goes up by same factor

But max TDP for air cooling is expected to stay flat


Kev

in S

kadr

on, 2

008

13

Thermal ConsiderationsWhen cooling is the main constraint

Pick max Tj, typically 100-125C, based on reliability, leakage tolerance, and ergonomicsThe most thermally efficient design maximizes TDP (and hopefully throughput) under this constraintHotspots hit Tj faster lost opportunity

Seek thermally uniform macro-architecturesMulticore layout and “spatial filtering” give you an extra lever

The smaller a power dissipator, the more effectively it spreads its heat [IEEE Trans. Computers, to appear]Ex: 2x2 grid vs. 21x21 grid: 188W TDP vs. 220 W (17%) – DAC 2008• Increase core density

• Or raise Vdd, Vth, etc.

Thinner dies, better packaging boost this effectSeek architectures that minimize area of high power density, maximize area in between, and can be easily partitioned

vs.


Kev

in S

kadr

on, 2

008

14

Where We are Today - Multicore

Classic architectures

Power wallProgrammability wall

http://interactive.usc.edu/classes/ctin542-designprod/archives/r2d2-01.jpg


Kev

in S

kadr

on, 2

008

15




Are heterogeneous architectures the answer?



Architecture and programming overview


Kev

in S

kadr

on, 2

008

16

Low-Fat Cores

PClaes Oldenburg, Apple Core – Autumn http://www.greenwicharts.org/pastshows.asp


Kev

in S

kadr

on, 2

008

17

What Do We Do?Make conventional cores more efficient

Lower-power flip-flops, lower-power clock treeSimpler pipeline, simpler cacheBut – this is running out of steam!

Use parallelism to boost efficiencySimpler general-purpose cores, multi-threadingAsymmetric architecturesBut - as # cores , overhead Specialized storage/communication (e.g., scratchpad, streams)But – this complicates programming

For all the above - the communication and memory hierarchy are designed for the lowest common denominatorRethink the architecture

SIMDSpecialized coprocessors (GPUs, media, crypto, FPGAs…)


Kev

in S

kadr

on, 2

008

18

Asymmetry Asymmetric cores with same ISA, different

sizes/microarchitectures Heterogeneous different ISAs (e.g. Fusion, Cell BE)

1-2 aggressive ILP cores, then scale up #simple cores

Supports good single-thread performance and good scalable throughput

Dynamic cores composing ILP cores from several simple cores avoids problems of fixed partitioning

What if 3 threads need high perf? Or 0?

But more design complexity

Federation (DAC 2008) – combines two simple, in-order cores to get one out-of-order core


Kev

in S

kadr

on, 2

008

19

General-Purpose Focus on ThroughputSimplifying cores saves area and powerAllows more processing elements (PEs) in a given areaMultithreading maximizes utilization of the PEs

Tolerates pipeline and memory latenciesPermits further simplification of coresCan leave lots of state idle (full register context) while waiting on memory

Software controlled data motion (via scratchpads or streams) are an alternative way to manage memory latency

Avoid unexpected cache missesStage data to overlap gather of the next “chunk” with computation using current “chunk”But - irregular data-access patterns and fine-grained read-write sharing are very hard to manage in software

Main drawback:General-purpose cores, memory hierarchy need to work for all programs lowest common denominator


Kev

in S

kadr

on, 2

008

20

Specialize (1)

SIMD+ Amortizes fetch, decode, control, and register-access

logic+ Tends to better preserve memory-access locality+ Space savings allow more ALUs or on-chip memory in

same area/total power

- Tends to have nasty crossbars

- Doesn’t deal well with threads that can’t stay in lockstep

• Multiple cores of limited SIMD width

• Work queues, conditional streams, etc. needed for reconvergence within a SIMD word

- How to support single-thread performance?

- Processor for a single “thread” is typically pretty wimpy

- Densely packed ALUs

• Can they be spread out?


Kev

in S

kadr

on, 2

008

21

Specialize (2)Is heterogeneity the answer?

Specialized coprocessors trade generality for efficiency

Datapaths, memory hierarchies tuned for certain types of code

Graphics processors (GPUs)

Network processors (NPUs)

Media processors

10-100x speedups often possible

May still be high-power cores

But also high performance/watt

Main drawbacks:Cooling can still be a challenge

Only suitable for certain types of algorithms

How do you choose which coprocessors to include?

Each has its own API—programming a collection of different coprocessors a potential nightmare


Kev

in S

kadr

on, 2

008

22

Programming for Heterogeneity

Need eitherVery wide applicability (GPUs, media processors)

An architecture-specific API can still survive with enough market share

Flexible programming model that applies to multiple types of coprocessors

Flexible specification of parallelism

Ability to use hardware-accelerated functions when available

– Transcendentals, string matching, etc.

We are going to need new, higher-level programming models anyway


Kev

in S

kadr

on, 2

008

23

High-Level Programming ModelsClaim: programmers who can do low-level parallel programming are an elite minority

We will never train the “average programmer” to write highly parallel programs in C, Java, X10, etc.Most people need to think about things in piecesAnd pieces need sequential semantics

But it’s ok if the “pieces” are internally parallelThreaded programming models don’t easily support such decomposition

Must develop APIs and libraries with higher-level abstractions

Simplify average programmer’s taskAllow advanced programmers to drill downWe will need this regardless of what the underlying architecture isBut it also buys us more flexibility in the architecture

Hiding hardware details facilitates heterogeneityBest APIs may be domain-specificDirectX, OpenGL are a good case study


Kev

in S

kadr

on, 2

008

24

3D Rendering APIs

Graphics Application

Vertex Program

Rasterization

Fragment Program

Display

High-level abstractions for rendering geometry

Courtesy of D. Luebke, NVIDIA


Kev

in S

kadr

on, 2

008

25

3D Rendering APIsHigh-level abstractions for rendering geometrySerial ordering among primitives

Implicit synchronization

No guarantees about ordering within primitives Means no fine-grained synchronization

Middleware translates to CPUs and various GPUs (NVIDIA, ATI, Intel)Domain-specific API is convenient for programmers and provides lots of semantic information to middleware: parallelism, load balancing, etc.Domain-specific API is convenient for hardware designers: same API supports radically different architectures across product generations and companiesSimilar arguments apply to Matlab, SQL, Map-Reduce, etc.These examples show how abstractions might solve the programming challenges associated with both

Highly parallel architecturesHeterogeneous architectures


Kev

in S

kadr

on, 2

008

26

Where Do GPUs Fit In?Need scalable, programmable multicore

Scalable: doubling PEs ~doubles performance

GPUs have been doing this for years

Programmable: easy to realize perf. potentialGPUs provide a pool of cores with general-purpose instruction sets (plus graphics-specific extras)

DirectX, Open3D allow apps to scale with the HW

CUDA leverages this background

© N

VID

IA,

20

07

FLOPS: NVIDIA GPU vs. Intel CPU


Kev

in S

kadr

on, 2

008

27

Why is CUDA Important? (1)

Mass market host platformEasy to buy and set up a system

Provides a solution for manycore parallelismNot limited to small core countsEasy to learn abstractions for massive parallelism

Abstractions not tied to a specific platformDoesn’t depend on graphics pipeline; can be implemented on other platformsPreliminary results suggest that CUDA programs run efficiently on multicore CPUs [Stratton’08]

Supports a wide range of application characteristics

More general than streamingNot limited to data parallelism


Kev

in S

kadr

on, 2

008

28

Why is CUDA Important? (2)

CUDA + GPUs facilitate multicore research at scale

NVIDIA Tesla: 16, 8-way SIMD cores = 128 PEs, 12,288 thread contexts totalSimple programming model allows exploration of new algorithms and hardware bottlenecksThe whole community can learn from this

CUDA + GPUs provide a real platform…todayResults are not theoreticalIncreases interest from potential users, e.g. computational scientistsBoosts opportunities for interdisciplinary collaboration

CUDA is teachableUndergrads can start writing real programs within a couple of weeks


Kev

in S

kadr

on, 2

008

29

Terminology: What is “GPGPU”?Definition 1: GPGPU = general purpose computing with GPUs = any use of GPUs for non-rendering tasksDefinition 2: GPGPU = general purpose computing with 3D APIs (i.e., DirectX and OpenGL)

3D APIs have processing overhead of entire graphics pipelineLimited interface to memory, no inter-thread communicationOften difficult to map application as rendering of polygon(s)These restrictions are now indelibly tied to “GPGPU”New wave of general-purpose computing avoids these restrictions new term “GPU Computing”


Kev

in S

kadr

on, 2

008

30

Summary So Far

ILP wall + power wall multicore

Power wall will limit multicore scaling too

Emphasizing throughput allows the individual cores to be simplified, reducing power

Thermal-aware design and placement can mitigate cooling limits

Eventually, all these techniques run out of steam

Heterogeneous architectures offer 10-100X performance, energy-efficiency benefits


Kev

in S

kadr

on, 2

008

31




Are heterogeneous architectures the answer?Not clear yet…



Architecture and programming overview


Kev

in S

kadr

on, 2

008

32

Outline of GPU Portion of Talk

Overview of interesting GPU features

Role of GPU in system architecture

More detail on CUDA

More detail on NVIDIA Tesla architecture


Kev

in S

kadr

on, 2

008

33

Tesla architecture, launched Nov 2006

128 scalar PEs (“unified shaders”)

Per-block shared memory (PBSM) allows communication among threads

Manycore GPU – Block Diagram

Thread Execution Manager

Input Assembler

Host

PBSM

Global Memory

Load/store

PBSM

Thread Processors

PBSM

Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors

PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSMPBSM

© NVIDIA, 2007


Kev

in S

kadr

on, 2

008

34

AMD/ATI Radeon HD 2900

320 PEs, but16-way SIMD of 5-way VLIW

Still based on 4-vectors (x,y,z,w)

Source: Michael Doggett, AMD, “Radeon HD 2900”, keynote at Graphics Hardware


Kev

in S

kadr

on, 2

008

35

CUDA vs. GPUsCUDA is a scalable parallel programming model and a software environment for parallel computing

Minimal extensions to familiar C/C++ environment

Heterogeneous serial-parallel programming model

Abstractions not GPU-specific

Also maps well to multicore CPUs! [Stratton’08]

AMD will use Brook+ -- details not yet available, but presumably similar goals of scalability, portability, heavier focus on stream primitives

NVIDIA’s TESLA GPU architecture accelerates CUDA, DirectX, OpenGL

Tesla architecture is basis of GeForce, Quadro, and Tesla product lines

G80 = GeForce 8800 GTX


Kev

in S

kadr

on, 2

008

36

Heterogeneous ProgrammingCUDA = serial program with parallel kernels, all in C

Serial C code executes in a CPU threadAMD CTM/CAL GPU interface is conceptually similar

Parallel kernel C code executes in thread blocks across multiple processing elements

Thread blocks are important for scalability

Serial Code

. . .

. . .

Parallel Kernel

KernelA<<< nBlk, nTid >>>(args);

Serial Code

Parallel Kernel

KernelB<<< nBlk, nTid >>>(args);

Courtesy of M. Garland, NVIDIA


Kev

in S

kadr

on, 2

008

37

How do GPUs differ from CPUs?Key: perf/mm2

Emphasize throughput, not per-thread latencyMaximize number of PEs and utilization

Many small PEsAmortize hardware in time--multithreadingHide latency with computation, not caching

Spend area on PEs insteadHide latencies with fast thread switch and many threads/PE(24 on NVIDIA Tesla/G80!)

Exploit SIMD efficiencyAmortize hardware in space—share fetch/control among multiple PEs

8 in the case of TeslaNote that SIMD vector

NVIDIA’s architecture is “scalar SIMD” (SIMT), AMD does both

High bandwidth to global memoryMinimize amount of multithreading neededTesla memory interface is 384-bit, AMD Radeon 2900 is 512-bit

Net result: 470 GFLOP/s and ~80 GB/s sustained in GeForce 8800GTX


Kev

in S

kadr

on, 2

008

38

How do GPUs differ from CPUs? (2)

Hardware thread creation and managementNew thread for each vertex/pixel

CPU: kernel or user-level software involvement

Virtualized coresProgram is agnostic about physical number of cores

True for both 3D and general-purpose

CPU: number of threads generally f(# cores)

Hardware barriers

These characteristics simplify problem decomposition, scalability, and portability

Nothing prevents non-graphics hardware from adopting these features


Kev

in S

kadr

on, 2

008

39


Specialized graphics hardware exposed through CUDA

Texture pathHigh-bandwidth gather, interpolation

Constant memoryEven higher-bandwidth access to small read-only data regions

Transcendentals (reciprocal sqrt, trig, log2, etc.)Different implementation of atomic memory operations

GPU: handled in memory interfaceCPU: generally handled with CPU involvement

Local scratchpad in each core (a.k.a. per-block shared memory)Memory system exploits spatial, not temporal locality


Kev

in S

kadr

on, 2

008

40


Fundamental trends are actually very generalExploit parallelism in time and space

Other processor families are following similar paths (multithreading, SIMD, etc.)

Radeon

Niagara

Larrabee

Network/content processors

Clearspeed

Cell BE

Many others…

HeterogeneousCell BE

Fusion, Tolapai


Kev

in S

kadr

on, 2

008

41

Myths of GPU Computing

GPUs layer normal programs on top of graphicsNO: CUDA compiles directly to the hardware

GPUs architectures are:Very wide (1000s) SIMD machinesNO: NVIDIA Tesla is 32-wideBranching is impossible or prohibitiveNO: Flexible branching and efficient management of SIMD divergenceWith 4-wide vector registersStill true for AMD RadeonNO: NVIDIA Tesla is scalar

GPUs don’t do real floating pointNO: Almost full IEEE single-precision FP compliance now(still limited under/over-flow handling)Double precision coming in next-gen architecture


Kev

in S

kadr

on, 2

008

42

GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling

Flush to zeroSupported,1000’s of cycles

Supported,1000’s of cycles

Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm

Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy

24 bit 12 bit 12 bit 12 bit

Reciprocal sqrt estimate accuracy

23 bit 12 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy

23 bit No 12 bit No

© NVIDIA, 2007


Kev

in S

kadr

on, 2

008

43




More detail on CUDA



Kev

in S

kadr

on, 2

008

44

Role of GPU in System Architecture

Historically, GPU as a discrete boardLarge processor die, large dedicated memoryLikely to remain source of largest FLOPs

“Integrated” GPU now included in chipsetsTypically very small, ~1/16th capability of high-end GPUNo dedicated memory today

“Fused” GPU coming soonIntel and AMD have both announced combination of CPU and GPU on the same dieGPU less capable than discrete GPUBut tight HW coupling allows tight SW coupling between CPU, GPU tasksHard architectural boundary between CPU and GPU could eventually be relaxed


Kev

in S

kadr

on, 2

008

45

Heterogeneous Architectures

Same choices are/will be available for other coprocessors

Media processors, network processors, FPGAs

All available as discrete boards, some included in chipsets

Growing support for “peer” supportCoprocessor fits in CPU socket on an SMP motherboard

FPGAs often discussed in this context

Integration of other coprocessors with CPU cores on same die seems inevitable


Kev

in S

kadr

on, 2

008

46

Implications

Discrete: high offload cost, offload “chunk” must be large enough to amortize this overhead

Bringing coprocessor closer reduces time, power cost for offload

Joining coprocessor and CPU on the same die could allow very tight coupling

Coprocessor features exposed through ISA

Shared memory, coherent caching

More flexible coprocessor exception handling

Drawbacks of integration:Integration limits size of coprocessor

e.g., can’t be competitive with highest-end GPU

Risk that these will be low-end, low-margin parts

Premium pricing will require major value from the tight coupling


Kev

in S

kadr

on, 2

008

47




More detail on CUDA



Kev

in S

kadr

on, 2

008

48

CUDA: Programming GPU in C

Philosophy: provide minimal set of C extensions necessary to expose general-purpose massively-parallel capabilities

Declaration specifiers to indicate where things live__global__ void KernelFunc(...); // kernel function, runs on device

__device__ int GlobalVar; // variable in device memory__shared__ int SharedVar; // variable in per-block shared memory

Extend function invocation syntax for parallel kernel launchKernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads

each

Special variables for thread identification in kernelsdim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim;

Intrinsics that expose specific operations in kernel code__syncthreads(); // barrier synchronization within kernel


Kev

in S

kadr

on, 2

008

49

Some Design Goals

Scale to 100’s of cores, 10,000’s of parallel threads

Let programmers focus on parallel algorithmsnot mechanics of a parallel programming language

Enable heterogeneous systems (i.e., CPU + discrete GPU)

CPU & GPU are separate devices with separate DRAMs

Does not prevent use with integrated or peer organizations


Kev

in S

kadr

on, 2

008

50

Key Parallel Abstractions in CUDA

Hierarchy of concurrent threads

Lightweight synchronization primitives

Shared memory model for cooperating threads


Kev

in S

kadr

on, 2

008

51

Hierarchy of Concurrency

Kernels composed of many parallel threadsAll threads execute the same sequential program

But don’t need to execute in lockstep

Threads are grouped into thread blocksThreads in the same block can communicate and cooperate

Notion of thread blocks is important for scalability

Threads/blocks have unique IDs

Thread t

t0 t1 … tBBlock b


Kev

in S

kadr

on, 2

008

52

Example: Vector Addition Kernel

// Compute vector sum C = A+B

// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C)

{

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i];

}

int main()

{

// Run N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}

Device Code



Kev

in S

kadr

on, 2

008

53





{


C[i] = A[i] + B[i];

}

int main()

{


vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}

Device Code


Host Code


Kev

in S

kadr

on, 2

008

54

What is a Thread?

Independent thread of executionHas its own PC, variables (registers), processor state, etc.

No implication about how threads are scheduled

Threads need not execute in lockstep

No restrictions on branching

CUDA threads might be physical threadsAs on NVIDIA GPUs

CUDA threads might be virtual threadsMight pick 1 block = 1 physical thread on multicore CPU [Stratton’08]


Kev

in S

kadr

on, 2

008

55

What is a Thread Block?

Thread block = virtualized multiprocessorAllows problem decomposition according to application’s parallelism

Can customize # thread blocks for each kernel launch

Thread block = a (data) parallel taskAll blocks in kernel have the same entry point

But may execute any code they want

Thread blocks of kernel must be independent tasksProgram must be valid for any interleaving of block executions

Thread blocks execute to completion without pre-emption


Kev

in S

kadr

on, 2

008

56

Blocks Must Be Independent

Any possible interleaving of blocks should be valid

Presumed to run to completion without pre-emption

Can run in any order

Can run concurrently OR sequentially

Blocks may coordinate but not synchronizeShared queue pointer: OK

Shared lock: BAD … can easily deadlock

Independence requirement gives scalability



Kev

in S

kadr

on, 2

008

57

Synchronization of Blocks

Threads within block may synchronize with barriers

… Step 1 …__syncthreads();… Step 2 …

Blocks coordinate via atomic memory operations

e.g., increment shared queue pointer with atomicInc()

Implicit barrier between dependent kernelsvec_minus<<<nblocks, blksize>>>(a, b, c);vec_dot<<<nblocks, blksize>>>(c, c);



Kev

in S

kadr

on, 2

008

58

Types of Parallelism

Thread parallelismEach thread is an independent thread of execution

Data parallelismAcross threads in a block

Across blocks in a kernel

Task parallelismDifferent blocks are independent

Independent kernels


Kev

in S

kadr

on, 2

008

59

Memory Model (1)

Thread

Per-threadLocal Memory

BlockPer-Block

SharedMemory(PBSM)



Kev

in S

kadr

on, 2

008

60

Memory Model (2)

Kernel 0

. . .Per-device

GlobalMemory

. . .

Kernel 1

SequentialKernels



Kev

in S

kadr

on, 2

008

61

Memory Model (3)

Device 0memory

Device 1memory

Host memory cudaMemcpy()



Kev

in S

kadr

on, 2

008

62

CUDA: Host Semantics

Explicit memory allocation returns pointers to GPU memory

cudaMalloc(), cudaFree()

Explicit memory copy for host ↔ device, device ↔ device

cudaMemcpy(), cudaMemcpy2D(), ...

Texture management

cudaBindTexture(), cudaBindTextureToArray(), ...

OpenGL & DirectX interoperability

cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …



Kev

in S

kadr

on, 2

008

63





{


C[i] = A[i] + B[i];

}

int main()

{


vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}



Kev

in S

kadr

on, 2

008

64

Example: Host Code for vecAdd

// allocate and initialize host (CPU) memoryfloat *h_A = …, *h_B = …;

// allocate device (GPU) memoryfloat *d_A, *d_B, *d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),

cudaMemcpyHostToDevice) );cudaMemcpy( d_B, h_B, N * sizeof(float),

cudaMemcpyHostToDevice) );

// execute the kernel on N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);



Kev

in S

kadr

on, 2

008

65

Compiling CUDA for GPUs

NVCC

C/C++ CUDAApplication

PTX to TargetTranslator

GPU … GPU

Target device code

PTX CodeGeneric

Specialized

CPU Code

Courtesy J. Nickolls, NVIDIA


Kev

in S

kadr

on, 2

008

66

Sparse Matrix-Vector Multiplicationfloat multiply_row(uint size, uint *Aj, float *Av, float *x);

void csrmul_serial(uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y){ for(uint row=0; row<num_rows; ++row) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];

y[row] = multiply_row(row_end-row_begin, Aj+row_begin, Av+row_begin, x); }}



Kev

in S

kadr

on, 2

008

67

float multiply_row(uint size, uint *Aj, float *Av, float *x);

__global__void csrmul_kernel(uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y){ uint row = blockIdx.x*blockDim.x + threadIdx.x;

if( row<num_rows ) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];

y[row] = multiply_row(row_end-row_begin, Aj+row_begin, Av+row_begin, x); }}


Sparse Matrix-Vector Multiplication


Kev

in S

kadr

on, 2

008

68

Reducing Memory Bandwidth via Caching in Shared Memory

__global__ void csrmul_cached(… … … … … …) { uint begin = blockIdx.x*blockDim.x, end = begin+blockDim.x; uint row = begin + threadIdx.x;

__shared__ float cache[blocksize]; // array to cache rows if( row<num_rows) cache[threadIdx.x] = x[row]; // fetch to cache __syncthreads();

if( row<num_rows ) { uint row_begin = Ap[row], row_end = Ap[row+1]; float sum = 0;

for(uint col=row_begin; col<row_end; ++col) { uint j = Aj[col];

// Fetch from cached rows when possible float x_j = (j>=begin && j<end) ? cache[j-begin] : x[j];

sum += Av[col] * x_j; }

y[row] = sum; }}



Kev

in S

kadr

on, 2

008

69

Basic Efficiency Rules

Develop algorithms with a data parallel mindset

Simple example – parallel summation now requires a reduction

Maximize locality of global memory accessesThis will improve memory bandwidth utilization and, depending on platform, local caching

Exploit per-block shared memory as scratchpad

Even on CPUs, this will improve locality

Similar to benefits of blocking

Expose enough parallelismNeed minimum of 1000s of threads


Kev

in S

kadr

on, 2

008

70

Summary So Far

Three key generic abstractions:1.hierarchy of parallel threads

2.corresponding levels of synchronization

3.corresponding memory spaces

Thread blocks promote scalable algorithms

Focus on parallelism, correctness, and scalability first

Then a few standard optimizations usually produce significant additional speedup

CUDA illustrates promising directions to pursue for other coprocessors and heterogeneous systems in general


Kev

in S

kadr

on, 2

008

71




More detail on CUDA



Kev

in S

kadr

on, 2

008

72

128 scalar PEs (“unified shaders”)

Per-block shared memory (PBSM) allows communication among threads

Tesla Architecture

Thread Execution Manager

Input Assembler

Host

PBSM

Global Memory

Load/store

PBSM

Thread Processors

PBSM

Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors

PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSMPBSM

© NVIDIA, 2007


Kev

in S

kadr

on, 2

008

73

Tesla C870

681 million transistors470 mm2 in 90 nm CMOS

128 thread processors518 GFLOPS peak1.35 GHz processor clock

1.5 GB DRAM76 GB/s peak800 MHz GDDR3 clock384 pin DRAM interface

ATX form factor cardPCI Express x16170 W max with DRAM

© NVIDIA, 2007


Kev

in S

kadr

on, 2

008

74

Streaming Multiprocessor (SM)

Processing elements8 scalar thread processors (SP)32 GFLOPS peak at 1.35 GHz8192 32-bit registers (32KB)

½ MB total register file space!

usual ops: float, int, branch, …also transcendentals, atomics

Hardware multithreadingup to 8 blocks resident at onceup to 768 active threads in total

16KB on-chip memory (PBSM)low latency storageshared among threads of a blockallows threads to cooperate

SP

SharedMemory

MT IU

SM t0 t1 … tB

© NVIDIA, 2007


Kev

in S

kadr

on, 2

008

75

Blocks Run on Multiprocessors

Kernel launched by host

. . .

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

. . .

Device processor array

Device Memory

Courtesy D. Luebke, NVIDIA


Kev

in S

kadr

on, 2

008

76

Hardware Multithreading

Hardware (GPU) allocates resources to blocks

Blocks need: thread slots, registers, shared memory

Blocks don’t run until resources are available

Hardware (SM) schedules threadsThreads have their own registers

Any thread not waiting for something can run

Context switching is (basically) free – every cycle

Hardware relies on threads to hide latency

Parallelism is necessary for performance

SP

SharedMemory

MT IU

SM



Kev

in S

kadr

on, 2

008

77

Tesla SIMT Thread Execution

Groups of 32 threads formed into warps

Always executing same instructionShared instruction fetch/dispatchSome become inactive when code path divergesHardware automatically handles divergence

Warps are the primitive unit of scheduling

pick 1 of 24 warps for each instruction slot

SIMT execution is an implementation choice

Sharing control logic leaves space for more ALUsLargely invisible to programmerMust understand for performance, not correctness


warp 8 instruction 11

SM multithreadedinstruction scheduler




...

time


Courtesy J. Nickolls, NVIDIA


Kev

in S

kadr

on, 2

008

78

SPSharedMemory

MT

IU

Device Memory

Texture Cache Constant Cache

I Cac

he

Memory ArchitectureDirect load/store access to device memory

Treated as the usual linear sequence of bytes (i.e., not pixels)

Texture & constant caches are read-only access pathsOn-chip shared memory shared among threads of a block

Important for communication amongst threadsProvides low-latency temporary storage (~100x less than DRAM)

HostMemory

PCIeCourtesy D. Luebke, NVIDIA


Kev

in S

kadr

on, 2

008

79

Summary So Far

Key Tesla Architecture FeaturesScalar ISA

32-wide SIMT

Deeply multithreaded

Per-block shared memory

Designed for scalability


Kev

in S

kadr

on, 2

008

80

Conclusions

ILP wall + power wall multicore

Power wall will limit multicore scaling too

Coprocessors offer compelling performance and energy-efficiency benefits

Architecture of a heterogeneous system—open question

Programmability is the key challenge for heterogeneous architectures

CUDA offers interesting lessons on generic abstractions, scalability

GPUs are an interesting platform for research on parallelism, heterogeneity

Manycore architectureFacilitate parallelism research at scale

Can be placed at various positions in system architecture


Kev

in S

kadr

on, 2

008

81

Thank You

Questions?

Contact me: [email protected]

http://www.cs.virginia.edu/~skadron

university of virginia © kevin skadron, 2008 kevin skadron university of virginia dept. of computer...

Documents