© david kirk/nvidia and wen-mei w. hwu, 1 programming massively parallel processors lecture slides...

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1

Programming Massively Parallel Processors

Lecture Slides for Chapter 1: Introduction

Two Main Trajectory

• Since 2003, semiconductor industry follow two main trajectories:– Multicore: seek to maintain the execution speed

of sequential program. Reduce the Latency– Many core: improve the execution throughput of

parallel application. Each heavily multi-threaded core is much smaller and some cores share control and instruction cache.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010ECE 408, University of Illinois, Urbana-Champaign

2


3

DRAM

Cache

ALU

Control

ALU

ALU

ALU

DRAM

CPU GPU

CPUs and GPUs have fundamentally different design philosophies

Multicore CPU• Optimized for sequential program sophisticated

control logic to allow instructions from a single thread to execute faster. In order to minimize the latency, large on-chip cache to reduce the long-latency memory access to cache accesses, the execution latency of each thread is reduced. However, the large cache memory (multiple megabytes, low-latency arithmetic units and sophisticated operand delivery logic consume chip area and power. – Latency-oriented design


4

Multicore CPU• Many applications are limited by the speed

at which data can be moved from memory to processor. – CPU has to satisfy the requirements from legacy

OS and I/O, more difficult to let memory bandwidth to increase. Usually 1/6 of GPU


5

Many-core GPU

• Shaped by the fast-growing video game industry that expects tremendous massive number of floating-pint calculations per video frame.

• Motive to look for ways to maximize the chip area and power budget dedicated to floating-point calculations. Solution is to optimize for the execution throughput of massive number of threads. The design saves chip area and power by allowing pipelined memory channels and arithmetic operations to have long latency. The reduce area and power on memory and arithmetic allows designers to have more cores on a chip to increase the execution throughput.


6

Many-core GPU

• A large number of threads to find work to do when some of them are waiting for long-latency memory accesses or arithmetic operations. Small cache are provide to help control the bandwidth requirements so multiple threads that access the same memory do not need to go the DRAM. – Throughput-oriented design: that thrives to maximize the total

execution throughput of a large number of threads while allowing individual threads to take a potentially much longer time to execute.


7

CPU + GPU• GPU will not perform well on tasks on which CPUs

are design to perform well. For program that have one or very few threads, CPUs with lower operation latencies can achieve much higher performance than GPUs.

• When a program has a large number of threads, GPUs with higher execution throughput can achieve much higher performance than CPUs. Many applications use both CPUs and GPUs, executing the sequential parts on the CPU and numerically intensive parts on the GPUs.

© David Kirkand Wen-mei W. Hwu, 2007-2010ECE 408, University of Illinois, Urbana-Champaign

8

GPU adoption

• The processors of choice must have a very large presence in the market place.– 400 million CUDA-enabled GPUs in use to date.

• Practical form factors and easy accessibility– Until 2006, parallel programs are run on data centers or

clusters. Actual clinical applications on MRI machines are based on a PC and special hardware accelerators. GE and Siemens cannot sell racks to clinical settings. NIH refused to fund parallel programming projects. Today NIH funds research using GPU.


9


10

Why Massively Parallel Processor• A quiet revolution and potential build-up

– Calculation: 367 GFLOPS vs. 32 GFLOPS– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s– Until last year, programmed through graphics API

– GPU in every PC and workstation – massive volume and potential impact


11

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Architecture of a CUDA-capable GPUTwo streaming multiprocessors form a building blockEach has a number of streaming processors that share control logic and instruction cache. Each GPU comes with multiple gigabytes of DRAM (global memory). Offers High bandwidth off-chip, though with longer latency than typical system memory.High bandwidth makes up for the longer latency for massively parallel applicationsG80: 86.4 GB/s of memory bandwidth plus 8GB/s up and down 4Gcommunication bandwidth with CPU

A good application runs 5k to 12k threads. CPU support 2 to 8 threads.


12

GT200 Characteristics• 1 TFLOPS peak performance (25-50 times of current high-

end microprocessors)• 265 GFLOPS sustained for apps such as VMD• Massively parallel, 128 cores, 90W• Massively threaded, sustains 1000s of threads per app• 30-100 times speedup over high-end microprocessors on

scientific and media applications: medical imaging, molecular dynamics

“I think they're right on the money, but the huge performance differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s) will invite close scrutiny so I have to be careful what I say publically until I triple check those numbers.”

-John Stone, VMD group, Physics UIUC


13

Future Apps Reflect a Concurrent World

• Exciting applications in future mass computing market have been traditionally considered “supercomputing applications”– Molecular dynamics simulation, Video and audio coding and

manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products

– These “Super-apps” represent and model physical, concurrent world

• Various granularities of parallelism exist, but…– programming model must not hinder parallel implementation– data delivery needs careful management


14

Stretching Traditional Architectures • Traditional parallel architectures cover some super-applications

– DSP, GPU, network apps, Scientific

• The game is to grow mainstream architectures “out” or domain-specific architectures “in”– CUDA is latter

Traditional applications

Current architecture coverage

New applications

Domain-specificarchitecture coverage

Obstacles

Software Evolvement

• MPI: scale up to 100,000 nodes.

• CUDA shared memory for parallel execution. Programmers manage the data transfer between CPU and GPU and detailed parallel code construct.

• OpenMP: shared memory. Not able to scale beyond a couple of hundred cores due to thread managements overhead and cache coherence. Compilers do most of the automation in managing parallel execution.

• OpenCL (2009): Apple, Intel AMD/ATI, NViDia: proposed a standard programming model. Define language extension and run-time API. Application developed in OpenCL can run on any processors that support OpenCL language extension and API without code modification

• OpenACC (2011): compiler directives to specific loops and regions of code to offload from CPU to GPU. More like OpenMP.


15

© david kirk/nvidia and wen-mei w. hwu, 1 programming massively parallel processors lecture slides...

Documents

university of illinois

introduction david kirknvidia

gpu david kirknvidia

execution latency

latency memory accesses

long latency

chip cache

longlatency memory access