university of virginia © kevin skadron, 2008 kevin skadron university of virginia dept. of computer...
TRANSCRIPT
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
Kevin Skadron
University of Virginia Dept. of Computer Science
LAVA Lab
and
NVIDIA Research
Massively Parallel Graphics Processors in a
Multicore, Power-Limited Era
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
2
Outline of Overall Talk
Why multicore?How did we get into this jam?
What next?How do we get out of this jam?
Are heterogeneous architectures the answer?
What is the role of graphics processors (GPUs)?
Role in system architecture
Architecture and programming overview(2nd half of talk)
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
3
Disclaimer
The opinions here are my own as a computer engineer. They represent my interpretation of technology trends and associated opportunities. They do not in any way represent positions or plans of University of Virginia or NVIDIA.
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
4
Why Multicore?How did we get here? Combination of both “ILP wall” and “power wall”
ILP wall: wider superscalar & more aggressive OO execution diminishing returns
Boost single-thread performance boost frequency
Power wall: boosting frequency to keep up with Moore’s Law (2X per generation) is expensive
Natural frequency growth with technology scaling is only ~20-30% per generation
– Don’t need expensive microarchitectures just for thisFaster frequency growth requires
– Aggressive circuits (expensive)– Very deep pipeline – 30+ stages? (expensive)– Power-saving techniques weren’t able to compensate
No longer worth the Si, cooling costs, or battery life
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
5
Single-core Watts/Spec
0.001
0.01
0.1
1
0 1 10 100 1000 10000
Spec2000
Wat
ts/S
pec
intel 386
intel 486
intel pentium
intel pentium 2
intel pentium 3
intel pentium 4
intel itanium
Alpha 21064
Alpha 21164
Alpha 21264
Sparc
SuperSparc
Sparc64
Mips
HP PA
Power PC
AMD K6
AMD K7
AMD x86-64
(through 2005)
(courtesy Mark Horowitz)(Normalized to same technology node)
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
6
The Multi-core RevolutionCan’t make a single core much faster
But need to maintain profit margins
More and more cache diminishing returns
“New” Moore’s LawSame core is 2X smaller per generation, can double # cores
Focus on throughputCan use smaller, lower-power cores (even in-order issue)
Make cores multi-threaded
Trade single-thread performance for Better throughput
Lower power density
Maybe keep one aggressive core so we don’t make single-thread performance worse
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
7
Can Parallelism Succeed?
Parallel computing never took off as a commodity
Expensive and rare despite decades of investments
What’s different this time?Need – power wall!
Opportunity – commodity hardware is out there now in hundreds of millions of PCs and servers
x86 multicores – 6-way multicore has been announced
GPUs – 100+ multicore!
This breaks the chicken-and-egg problem
– Parallel language and software creators no longer need to wait for parallel hadware
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
8
What To Do With All Those Cores?
PC workloads have limited number of independent tasks
Parallel programming is hard, isn’t it?
Well, at least we solved the power wall…or did we?
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
9
Moore’s Law and Dennard Scaling
Moore’s Law: transistor density doubles every N years (currently N ~ 2)
Dennard Scaling (constant electric field)Shrink feature size by k (typ. 0.7), hold electric field constant
Area scales by k2 (1/2) , C, V, delay reduce by k
P CV2f P goes down by k2
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
10
Moore’s Law and Dennard Scaling
Works well for “shrinks”
Doesn’t apply to high endGenerally keep area constant, use doubled transistor density to add more features, so total C doesn’t scale
Perf = Insts * CPI * cycle time
Out of order execution, wide superscalar, aggressive speculation to boost instruction level parallelism (improve CPI)
Aggressive pipelining, circuits, etc. to boost frequency (cycle time) beyond “natural” rate
Leakage has been going up
Power and power density went up, not down
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
11
Actual PowerM
ax
Po
we
r (W
att
s)
i386 i386
i486 i486
Pentium® Pentium®
Pentium® w/MMX tech.
Pentium® w/MMX tech.
1
10
100
Pentium® Pro Pentium® Pro
Pentium® II Pentium® II
Pentium® 4Pentium® 4Pentium® 4Pentium® 4
Pentium® III Pentium® III
Source: Intel
Core 2 Duo
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
12
Power Wall Redux
Vdd scaling is coming to a haltCurrently 0.9-1.0V, scaling only ~2.5%/gen [ITRS’06]
Even if we generously assume C scales and frequency is flat
P CV2f 0.7 (0.9752) (1) = 0.66
Power density goes upP/A = 0.66/0.5 = 1.33
And this is very optimistic, because C probably scales more like 0.8 or 0.9, and we want frequency to go up, so a more likely number is 1.5-1.75X
If we keep %-area dedicated to all the cores the same -- total power goes up by same factor
But max TDP for air cooling is expected to stay flat
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
13
Thermal ConsiderationsWhen cooling is the main constraint
Pick max Tj, typically 100-125C, based on reliability, leakage tolerance, and ergonomicsThe most thermally efficient design maximizes TDP (and hopefully throughput) under this constraintHotspots hit Tj faster lost opportunity
Seek thermally uniform macro-architecturesMulticore layout and “spatial filtering” give you an extra lever
The smaller a power dissipator, the more effectively it spreads its heat [IEEE Trans. Computers, to appear]Ex: 2x2 grid vs. 21x21 grid: 188W TDP vs. 220 W (17%) – DAC 2008• Increase core density
• Or raise Vdd, Vth, etc.
Thinner dies, better packaging boost this effectSeek architectures that minimize area of high power density, maximize area in between, and can be easily partitioned
vs.
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
14
Where We are Today - Multicore
Classic architectures
Power wallProgrammability wall
http://interactive.usc.edu/classes/ctin542-designprod/archives/r2d2-01.jpg
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
15
Outline of Overall Talk
Why multicore?How did we get into this jam?
What next?How do we get out of this jam?
Are heterogeneous architectures the answer?
What is the role of graphics processors (GPUs)?
Role in system architecture
Architecture and programming overview
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
16
Low-Fat Cores
PClaes Oldenburg, Apple Core – Autumn http://www.greenwicharts.org/pastshows.asp
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
17
What Do We Do?Make conventional cores more efficient
Lower-power flip-flops, lower-power clock treeSimpler pipeline, simpler cacheBut – this is running out of steam!
Use parallelism to boost efficiencySimpler general-purpose cores, multi-threadingAsymmetric architecturesBut - as # cores , overhead Specialized storage/communication (e.g., scratchpad, streams)But – this complicates programming
For all the above - the communication and memory hierarchy are designed for the lowest common denominatorRethink the architecture
SIMDSpecialized coprocessors (GPUs, media, crypto, FPGAs…)
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
18
Asymmetry Asymmetric cores with same ISA, different
sizes/microarchitectures Heterogeneous different ISAs (e.g. Fusion, Cell BE)
1-2 aggressive ILP cores, then scale up #simple cores
Supports good single-thread performance and good scalable throughput
Dynamic cores composing ILP cores from several simple cores avoids problems of fixed partitioning
What if 3 threads need high perf? Or 0?
But more design complexity
Federation (DAC 2008) – combines two simple, in-order cores to get one out-of-order core
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
19
General-Purpose Focus on ThroughputSimplifying cores saves area and powerAllows more processing elements (PEs) in a given areaMultithreading maximizes utilization of the PEs
Tolerates pipeline and memory latenciesPermits further simplification of coresCan leave lots of state idle (full register context) while waiting on memory
Software controlled data motion (via scratchpads or streams) are an alternative way to manage memory latency
Avoid unexpected cache missesStage data to overlap gather of the next “chunk” with computation using current “chunk”But - irregular data-access patterns and fine-grained read-write sharing are very hard to manage in software
Main drawback:General-purpose cores, memory hierarchy need to work for all programs lowest common denominator
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
20
Specialize (1)
SIMD+ Amortizes fetch, decode, control, and register-access
logic+ Tends to better preserve memory-access locality+ Space savings allow more ALUs or on-chip memory in
same area/total power
- Tends to have nasty crossbars
- Doesn’t deal well with threads that can’t stay in lockstep
• Multiple cores of limited SIMD width
• Work queues, conditional streams, etc. needed for reconvergence within a SIMD word
- How to support single-thread performance?
- Processor for a single “thread” is typically pretty wimpy
- Densely packed ALUs
• Can they be spread out?
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
21
Specialize (2)Is heterogeneity the answer?
Specialized coprocessors trade generality for efficiency
Datapaths, memory hierarchies tuned for certain types of code
Graphics processors (GPUs)
Network processors (NPUs)
Media processors
10-100x speedups often possible
May still be high-power cores
But also high performance/watt
Main drawbacks:Cooling can still be a challenge
Only suitable for certain types of algorithms
How do you choose which coprocessors to include?
Each has its own API—programming a collection of different coprocessors a potential nightmare
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
22
Programming for Heterogeneity
Need eitherVery wide applicability (GPUs, media processors)
An architecture-specific API can still survive with enough market share
Flexible programming model that applies to multiple types of coprocessors
Flexible specification of parallelism
Ability to use hardware-accelerated functions when available
– Transcendentals, string matching, etc.
We are going to need new, higher-level programming models anyway
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
23
High-Level Programming ModelsClaim: programmers who can do low-level parallel programming are an elite minority
We will never train the “average programmer” to write highly parallel programs in C, Java, X10, etc.Most people need to think about things in piecesAnd pieces need sequential semantics
But it’s ok if the “pieces” are internally parallelThreaded programming models don’t easily support such decomposition
Must develop APIs and libraries with higher-level abstractions
Simplify average programmer’s taskAllow advanced programmers to drill downWe will need this regardless of what the underlying architecture isBut it also buys us more flexibility in the architecture
Hiding hardware details facilitates heterogeneityBest APIs may be domain-specificDirectX, OpenGL are a good case study
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
24
3D Rendering APIs
Graphics Application
Vertex Program
Rasterization
Fragment Program
Display
High-level abstractions for rendering geometry
Courtesy of D. Luebke, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
25
3D Rendering APIsHigh-level abstractions for rendering geometrySerial ordering among primitives
Implicit synchronization
No guarantees about ordering within primitives Means no fine-grained synchronization
Middleware translates to CPUs and various GPUs (NVIDIA, ATI, Intel)Domain-specific API is convenient for programmers and provides lots of semantic information to middleware: parallelism, load balancing, etc.Domain-specific API is convenient for hardware designers: same API supports radically different architectures across product generations and companiesSimilar arguments apply to Matlab, SQL, Map-Reduce, etc.These examples show how abstractions might solve the programming challenges associated with both
Highly parallel architecturesHeterogeneous architectures
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
26
Where Do GPUs Fit In?Need scalable, programmable multicore
Scalable: doubling PEs ~doubles performance
GPUs have been doing this for years
Programmable: easy to realize perf. potentialGPUs provide a pool of cores with general-purpose instruction sets (plus graphics-specific extras)
DirectX, Open3D allow apps to scale with the HW
CUDA leverages this background
© N
VID
IA,
20
07
FLOPS: NVIDIA GPU vs. Intel CPU
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
27
Why is CUDA Important? (1)
Mass market host platformEasy to buy and set up a system
Provides a solution for manycore parallelismNot limited to small core countsEasy to learn abstractions for massive parallelism
Abstractions not tied to a specific platformDoesn’t depend on graphics pipeline; can be implemented on other platformsPreliminary results suggest that CUDA programs run efficiently on multicore CPUs [Stratton’08]
Supports a wide range of application characteristics
More general than streamingNot limited to data parallelism
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
28
Why is CUDA Important? (2)
CUDA + GPUs facilitate multicore research at scale
NVIDIA Tesla: 16, 8-way SIMD cores = 128 PEs, 12,288 thread contexts totalSimple programming model allows exploration of new algorithms and hardware bottlenecksThe whole community can learn from this
CUDA + GPUs provide a real platform…todayResults are not theoreticalIncreases interest from potential users, e.g. computational scientistsBoosts opportunities for interdisciplinary collaboration
CUDA is teachableUndergrads can start writing real programs within a couple of weeks
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
29
Terminology: What is “GPGPU”?Definition 1: GPGPU = general purpose computing with GPUs = any use of GPUs for non-rendering tasksDefinition 2: GPGPU = general purpose computing with 3D APIs (i.e., DirectX and OpenGL)
3D APIs have processing overhead of entire graphics pipelineLimited interface to memory, no inter-thread communicationOften difficult to map application as rendering of polygon(s)These restrictions are now indelibly tied to “GPGPU”New wave of general-purpose computing avoids these restrictions new term “GPU Computing”
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
30
Summary So Far
ILP wall + power wall multicore
Power wall will limit multicore scaling too
Emphasizing throughput allows the individual cores to be simplified, reducing power
Thermal-aware design and placement can mitigate cooling limits
Eventually, all these techniques run out of steam
Heterogeneous architectures offer 10-100X performance, energy-efficiency benefits
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
31
Outline of Overall Talk
Why multicore?How did we get into this jam?
What next?How do we get out of this jam?
Are heterogeneous architectures the answer?Not clear yet…
What is the role of graphics processors (GPUs)?
Role in system architecture
Architecture and programming overview
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
32
Outline of GPU Portion of Talk
Overview of interesting GPU features
Role of GPU in system architecture
More detail on CUDA
More detail on NVIDIA Tesla architecture
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
33
Tesla architecture, launched Nov 2006
128 scalar PEs (“unified shaders”)
Per-block shared memory (PBSM) allows communication among threads
Manycore GPU – Block Diagram
Thread Execution Manager
Input Assembler
Host
PBSM
Global Memory
Load/store
PBSM
Thread Processors
PBSM
Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors
PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSMPBSM
© NVIDIA, 2007
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
34
AMD/ATI Radeon HD 2900
320 PEs, but16-way SIMD of 5-way VLIW
Still based on 4-vectors (x,y,z,w)
Source: Michael Doggett, AMD, “Radeon HD 2900”, keynote at Graphics Hardware
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
35
CUDA vs. GPUsCUDA is a scalable parallel programming model and a software environment for parallel computing
Minimal extensions to familiar C/C++ environment
Heterogeneous serial-parallel programming model
Abstractions not GPU-specific
Also maps well to multicore CPUs! [Stratton’08]
AMD will use Brook+ -- details not yet available, but presumably similar goals of scalability, portability, heavier focus on stream primitives
NVIDIA’s TESLA GPU architecture accelerates CUDA, DirectX, OpenGL
Tesla architecture is basis of GeForce, Quadro, and Tesla product lines
G80 = GeForce 8800 GTX
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
36
Heterogeneous ProgrammingCUDA = serial program with parallel kernels, all in C
Serial C code executes in a CPU threadAMD CTM/CAL GPU interface is conceptually similar
Parallel kernel C code executes in thread blocks across multiple processing elements
Thread blocks are important for scalability
Serial Code
. . .
. . .
Parallel Kernel
KernelA<<< nBlk, nTid >>>(args);
Serial Code
Parallel Kernel
KernelB<<< nBlk, nTid >>>(args);
Courtesy of M. Garland, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
37
How do GPUs differ from CPUs?Key: perf/mm2
Emphasize throughput, not per-thread latencyMaximize number of PEs and utilization
Many small PEsAmortize hardware in time--multithreadingHide latency with computation, not caching
Spend area on PEs insteadHide latencies with fast thread switch and many threads/PE(24 on NVIDIA Tesla/G80!)
Exploit SIMD efficiencyAmortize hardware in space—share fetch/control among multiple PEs
8 in the case of TeslaNote that SIMD vector
NVIDIA’s architecture is “scalar SIMD” (SIMT), AMD does both
High bandwidth to global memoryMinimize amount of multithreading neededTesla memory interface is 384-bit, AMD Radeon 2900 is 512-bit
Net result: 470 GFLOP/s and ~80 GB/s sustained in GeForce 8800GTX
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
38
How do GPUs differ from CPUs? (2)
Hardware thread creation and managementNew thread for each vertex/pixel
CPU: kernel or user-level software involvement
Virtualized coresProgram is agnostic about physical number of cores
True for both 3D and general-purpose
CPU: number of threads generally f(# cores)
Hardware barriers
These characteristics simplify problem decomposition, scalability, and portability
Nothing prevents non-graphics hardware from adopting these features
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
39
How do GPUs differ from CPUs? (3)
Specialized graphics hardware exposed through CUDA
Texture pathHigh-bandwidth gather, interpolation
Constant memoryEven higher-bandwidth access to small read-only data regions
Transcendentals (reciprocal sqrt, trig, log2, etc.)Different implementation of atomic memory operations
GPU: handled in memory interfaceCPU: generally handled with CPU involvement
Local scratchpad in each core (a.k.a. per-block shared memory)Memory system exploits spatial, not temporal locality
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
40
How do GPUs differ from CPUs? (4)
Fundamental trends are actually very generalExploit parallelism in time and space
Other processor families are following similar paths (multithreading, SIMD, etc.)
Radeon
Niagara
Larrabee
Network/content processors
Clearspeed
Cell BE
Many others…
HeterogeneousCell BE
Fusion, Tolapai
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
41
Myths of GPU Computing
GPUs layer normal programs on top of graphicsNO: CUDA compiles directly to the hardware
GPUs architectures are:Very wide (1000s) SIMD machinesNO: NVIDIA Tesla is 32-wideBranching is impossible or prohibitiveNO: Flexible branching and efficient management of SIMD divergenceWith 4-wide vector registersStill true for AMD RadeonNO: NVIDIA Tesla is scalar
GPUs don’t do real floating pointNO: Almost full IEEE single-precision FP compliance now(still limited under/over-flow handling)Double precision coming in next-gen architecture
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
42
GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754
Rounding modes for FADD and FMUL
Round to nearest and round to zero
All 4 IEEE, round to nearest, zero, inf, -inf
Round to nearest only
Round to zero/truncate only
Denormal handling
Flush to zeroSupported,1000’s of cycles
Supported,1000’s of cycles
Flush to zero
NaN support Yes Yes Yes No
Overflow and Infinity support
Yes, only clamps to max norm
Yes Yes No, infinity
Flags No Yes Yes Some
Square root Software only Hardware Software only Software only
Division Software only Hardware Software only Software only
Reciprocal estimate accuracy
24 bit 12 bit 12 bit 12 bit
Reciprocal sqrt estimate accuracy
23 bit 12 bit 12 bit 12 bit
log2(x) and 2^x estimates accuracy
23 bit No 12 bit No
© NVIDIA, 2007
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
43
Outline of GPU Portion of Talk
Overview of interesting GPU features
Role of GPU in system architecture
More detail on CUDA
More detail on NVIDIA Tesla architecture
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
44
Role of GPU in System Architecture
Historically, GPU as a discrete boardLarge processor die, large dedicated memoryLikely to remain source of largest FLOPs
“Integrated” GPU now included in chipsetsTypically very small, ~1/16th capability of high-end GPUNo dedicated memory today
“Fused” GPU coming soonIntel and AMD have both announced combination of CPU and GPU on the same dieGPU less capable than discrete GPUBut tight HW coupling allows tight SW coupling between CPU, GPU tasksHard architectural boundary between CPU and GPU could eventually be relaxed
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
45
Heterogeneous Architectures
Same choices are/will be available for other coprocessors
Media processors, network processors, FPGAs
All available as discrete boards, some included in chipsets
Growing support for “peer” supportCoprocessor fits in CPU socket on an SMP motherboard
FPGAs often discussed in this context
Integration of other coprocessors with CPU cores on same die seems inevitable
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
46
Implications
Discrete: high offload cost, offload “chunk” must be large enough to amortize this overhead
Bringing coprocessor closer reduces time, power cost for offload
Joining coprocessor and CPU on the same die could allow very tight coupling
Coprocessor features exposed through ISA
Shared memory, coherent caching
More flexible coprocessor exception handling
Drawbacks of integration:Integration limits size of coprocessor
e.g., can’t be competitive with highest-end GPU
Risk that these will be low-end, low-margin parts
Premium pricing will require major value from the tight coupling
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
47
Outline of GPU Portion of Talk
Overview of interesting GPU features
Role of GPU in system architecture
More detail on CUDA
More detail on NVIDIA Tesla architecture
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
48
CUDA: Programming GPU in C
Philosophy: provide minimal set of C extensions necessary to expose general-purpose massively-parallel capabilities
Declaration specifiers to indicate where things live__global__ void KernelFunc(...); // kernel function, runs on device
__device__ int GlobalVar; // variable in device memory__shared__ int SharedVar; // variable in per-block shared memory
Extend function invocation syntax for parallel kernel launchKernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads
each
Special variables for thread identification in kernelsdim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim;
Intrinsics that expose specific operations in kernel code__syncthreads(); // barrier synchronization within kernel
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
49
Some Design Goals
Scale to 100’s of cores, 10,000’s of parallel threads
Let programmers focus on parallel algorithmsnot mechanics of a parallel programming language
Enable heterogeneous systems (i.e., CPU + discrete GPU)
CPU & GPU are separate devices with separate DRAMs
Does not prevent use with integrated or peer organizations
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
50
Key Parallel Abstractions in CUDA
Hierarchy of concurrent threads
Lightweight synchronization primitives
Shared memory model for cooperating threads
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
51
Hierarchy of Concurrency
Kernels composed of many parallel threadsAll threads execute the same sequential program
But don’t need to execute in lockstep
Threads are grouped into thread blocksThreads in the same block can communicate and cooperate
Notion of thread blocks is important for scalability
Threads/blocks have unique IDs
Thread t
t0 t1 … tBBlock b
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
52
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Device Code
Courtesy of M. Garland, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
53
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Device Code
Courtesy of M. Garland, NVIDIA
Host Code
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
54
What is a Thread?
Independent thread of executionHas its own PC, variables (registers), processor state, etc.
No implication about how threads are scheduled
Threads need not execute in lockstep
No restrictions on branching
CUDA threads might be physical threadsAs on NVIDIA GPUs
CUDA threads might be virtual threadsMight pick 1 block = 1 physical thread on multicore CPU [Stratton’08]
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
55
What is a Thread Block?
Thread block = virtualized multiprocessorAllows problem decomposition according to application’s parallelism
Can customize # thread blocks for each kernel launch
Thread block = a (data) parallel taskAll blocks in kernel have the same entry point
But may execute any code they want
Thread blocks of kernel must be independent tasksProgram must be valid for any interleaving of block executions
Thread blocks execute to completion without pre-emption
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
56
Blocks Must Be Independent
Any possible interleaving of blocks should be valid
Presumed to run to completion without pre-emption
Can run in any order
Can run concurrently OR sequentially
Blocks may coordinate but not synchronizeShared queue pointer: OK
Shared lock: BAD … can easily deadlock
Independence requirement gives scalability
Courtesy of M. Garland, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
57
Synchronization of Blocks
Threads within block may synchronize with barriers
… Step 1 …__syncthreads();… Step 2 …
Blocks coordinate via atomic memory operations
e.g., increment shared queue pointer with atomicInc()
Implicit barrier between dependent kernelsvec_minus<<<nblocks, blksize>>>(a, b, c);vec_dot<<<nblocks, blksize>>>(c, c);
Courtesy of M. Garland, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
58
Types of Parallelism
Thread parallelismEach thread is an independent thread of execution
Data parallelismAcross threads in a block
Across blocks in a kernel
Task parallelismDifferent blocks are independent
Independent kernels
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
59
Memory Model (1)
Thread
Per-threadLocal Memory
BlockPer-Block
SharedMemory(PBSM)
Courtesy of M. Garland, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
60
Memory Model (2)
Kernel 0
. . .Per-device
GlobalMemory
. . .
Kernel 1
SequentialKernels
Courtesy of M. Garland, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
61
Memory Model (3)
Device 0memory
Device 1memory
Host memory cudaMemcpy()
Courtesy of M. Garland, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
62
CUDA: Host Semantics
Explicit memory allocation returns pointers to GPU memory
cudaMalloc(), cudaFree()
Explicit memory copy for host ↔ device, device ↔ device
cudaMemcpy(), cudaMemcpy2D(), ...
Texture management
cudaBindTexture(), cudaBindTextureToArray(), ...
OpenGL & DirectX interoperability
cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …
Courtesy of M. Garland, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
63
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Courtesy of M. Garland, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
64
Example: Host Code for vecAdd
// allocate and initialize host (CPU) memoryfloat *h_A = …, *h_B = …;
// allocate device (GPU) memoryfloat *d_A, *d_B, *d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),
cudaMemcpyHostToDevice) );cudaMemcpy( d_B, h_B, N * sizeof(float),
cudaMemcpyHostToDevice) );
// execute the kernel on N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
Courtesy of M. Garland, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
65
Compiling CUDA for GPUs
NVCC
C/C++ CUDAApplication
PTX to TargetTranslator
GPU … GPU
Target device code
PTX CodeGeneric
Specialized
CPU Code
Courtesy J. Nickolls, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
66
Sparse Matrix-Vector Multiplicationfloat multiply_row(uint size, uint *Aj, float *Av, float *x);
void csrmul_serial(uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y){ for(uint row=0; row<num_rows; ++row) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];
y[row] = multiply_row(row_end-row_begin, Aj+row_begin, Av+row_begin, x); }}
Courtesy of M. Garland, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
67
float multiply_row(uint size, uint *Aj, float *Av, float *x);
__global__void csrmul_kernel(uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y){ uint row = blockIdx.x*blockDim.x + threadIdx.x;
if( row<num_rows ) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];
y[row] = multiply_row(row_end-row_begin, Aj+row_begin, Av+row_begin, x); }}
Courtesy of M. Garland, NVIDIA
Sparse Matrix-Vector Multiplication
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
68
Reducing Memory Bandwidth via Caching in Shared Memory
__global__ void csrmul_cached(… … … … … …) { uint begin = blockIdx.x*blockDim.x, end = begin+blockDim.x; uint row = begin + threadIdx.x;
__shared__ float cache[blocksize]; // array to cache rows if( row<num_rows) cache[threadIdx.x] = x[row]; // fetch to cache __syncthreads();
if( row<num_rows ) { uint row_begin = Ap[row], row_end = Ap[row+1]; float sum = 0;
for(uint col=row_begin; col<row_end; ++col) { uint j = Aj[col];
// Fetch from cached rows when possible float x_j = (j>=begin && j<end) ? cache[j-begin] : x[j];
sum += Av[col] * x_j; }
y[row] = sum; }}
Courtesy of M. Garland, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
69
Basic Efficiency Rules
Develop algorithms with a data parallel mindset
Simple example – parallel summation now requires a reduction
Maximize locality of global memory accessesThis will improve memory bandwidth utilization and, depending on platform, local caching
Exploit per-block shared memory as scratchpad
Even on CPUs, this will improve locality
Similar to benefits of blocking
Expose enough parallelismNeed minimum of 1000s of threads
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
70
Summary So Far
Three key generic abstractions:1.hierarchy of parallel threads
2.corresponding levels of synchronization
3.corresponding memory spaces
Thread blocks promote scalable algorithms
Focus on parallelism, correctness, and scalability first
Then a few standard optimizations usually produce significant additional speedup
CUDA illustrates promising directions to pursue for other coprocessors and heterogeneous systems in general
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
71
Outline of GPU Portion of Talk
Overview of interesting GPU features
Role of GPU in system architecture
More detail on CUDA
More detail on NVIDIA Tesla architecture
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
72
128 scalar PEs (“unified shaders”)
Per-block shared memory (PBSM) allows communication among threads
Tesla Architecture
Thread Execution Manager
Input Assembler
Host
PBSM
Global Memory
Load/store
PBSM
Thread Processors
PBSM
Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors
PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSMPBSM
© NVIDIA, 2007
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
73
Tesla C870
681 million transistors470 mm2 in 90 nm CMOS
128 thread processors518 GFLOPS peak1.35 GHz processor clock
1.5 GB DRAM76 GB/s peak800 MHz GDDR3 clock384 pin DRAM interface
ATX form factor cardPCI Express x16170 W max with DRAM
© NVIDIA, 2007
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
74
Streaming Multiprocessor (SM)
Processing elements8 scalar thread processors (SP)32 GFLOPS peak at 1.35 GHz8192 32-bit registers (32KB)
½ MB total register file space!
usual ops: float, int, branch, …also transcendentals, atomics
Hardware multithreadingup to 8 blocks resident at onceup to 768 active threads in total
16KB on-chip memory (PBSM)low latency storageshared among threads of a blockallows threads to cooperate
SP
SharedMemory
MT IU
SM t0 t1 … tB
© NVIDIA, 2007
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
75
Blocks Run on Multiprocessors
Kernel launched by host
. . .
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
. . .
Device processor array
Device Memory
Courtesy D. Luebke, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
76
Hardware Multithreading
Hardware (GPU) allocates resources to blocks
Blocks need: thread slots, registers, shared memory
Blocks don’t run until resources are available
Hardware (SM) schedules threadsThreads have their own registers
Any thread not waiting for something can run
Context switching is (basically) free – every cycle
Hardware relies on threads to hide latency
Parallelism is necessary for performance
SP
SharedMemory
MT IU
SM
Courtesy D. Luebke, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
77
Tesla SIMT Thread Execution
Groups of 32 threads formed into warps
Always executing same instructionShared instruction fetch/dispatchSome become inactive when code path divergesHardware automatically handles divergence
Warps are the primitive unit of scheduling
pick 1 of 24 warps for each instruction slot
SIMT execution is an implementation choice
Sharing control logic leaves space for more ALUsLargely invisible to programmerMust understand for performance, not correctness
Courtesy D. Luebke, NVIDIA
warp 8 instruction 11
SM multithreadedinstruction scheduler
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
...
time
warp 3 instruction 96
Courtesy J. Nickolls, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
78
SPSharedMemory
MT
IU
Device Memory
Texture Cache Constant Cache
I Cac
he
Memory ArchitectureDirect load/store access to device memory
Treated as the usual linear sequence of bytes (i.e., not pixels)
Texture & constant caches are read-only access pathsOn-chip shared memory shared among threads of a block
Important for communication amongst threadsProvides low-latency temporary storage (~100x less than DRAM)
HostMemory
PCIeCourtesy D. Luebke, NVIDIA
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
79
Summary So Far
Key Tesla Architecture FeaturesScalar ISA
32-wide SIMT
Deeply multithreaded
Per-block shared memory
Designed for scalability
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
80
Conclusions
ILP wall + power wall multicore
Power wall will limit multicore scaling too
Coprocessors offer compelling performance and energy-efficiency benefits
Architecture of a heterogeneous system—open question
Programmability is the key challenge for heterogeneous architectures
CUDA offers interesting lessons on generic abstractions, scalability
GPUs are an interesting platform for research on parallelism, heterogeneity
Manycore architectureFacilitate parallelism research at scale
Can be placed at various positions in system architecture
UNIVERSITY OF VIRGINIA©
Kev
in S
kadr
on, 2
008
81
Thank You
Questions?
Contact me: [email protected]
http://www.cs.virginia.edu/~skadron