introduction to accelerators and gpgpu dan ernst cray, inc

29
Introduction to Accelerators and GPGPU Dan Ernst Cray, Inc.

Upload: iria

Post on 25-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Introduction to Accelerators and GPGPU Dan Ernst Cray, Inc. Conventional Wisdom (CW) in Computer Architecture. Old CW: Transistors expensive New CW: “ Power wall ” Power expensive, Transistors free (Can put more on chip than can afford to turn on) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

Introduction to Accelerators and GPGPU

Dan ErnstCray, Inc.

Page 2: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

Old CW: Transistors expensive New CW: “Power wall” Power expensive, Transistors free

(Can put more on chip than can afford to turn on)

Old: Multiplies are slow, Memory access is fast New: “Memory wall” Memory slow, multiplies fast

(200-600 clocks to DRAM memory, 4 clocks for FP multiply)

Old : Increasing Instruction Level Parallelism (ILP) via compilers, innovation (Out-of-order, speculation, VLIW, …)

New CW: “ILP wall” diminishing returns on more ILP

New: Power Wall + Memory Wall + ILP Wall = Brick Wall Old CW: Uniprocessor performance 2X / 1.5 yrs New CW: Uniprocessor performance only 2X / 5 yrs?

Conventional Wisdom (CW) in Computer Architecture

Credit: D. Patterson, UC-Berkeley

Page 3: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

It turns out, sacrificing uniprocessor performance for power savings can save you a lot.

Example: Scenario One: one-core processor with power budget W

Increase frequency/ILP by 20% Substantially increases power, by more than 50% But, only increase performance by 13%

Scenario Two: Decrease frequency by 20% with a simpler core Decreases power by 50% Can now add another core (one more ox!)

The Ox Analogy

"If one ox could not do the job, they did not try to grow a bigger ox, but used two oxen." - Admiral Grace Murray Hopper.

Page 4: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

Chickens are gaining momentum nowadays:For certain classes of applications (not including field

plowing...), you can run many cores at lower frequency and come ahead (big time) at the speed game

Molecular Dynamics Codes (VMD, NAMD, etc.) reported speedups of 25x – 100x!!

The Ox Analogy Extended

"If one ox could not do the job, they did not try to grow a bigger ox, but used two oxen." - Admiral Grace Murray Hopper.

"If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens ?" - Seymour Cray

Page 5: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

Oxen are good at plowingChickens pick up feed

Which do I use if I want to catch mice?I’d much rather have a couple cats

Moral: Finding the most appropriate tool for the job brings about savings in efficiency

Addendum: That tool will only exist and be affordable if someone can make money on it.

Ox vs. Chickens

Page 6: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

Example of Efficiency

Cray High Density Custom Compute System

“Same” performance on Cray’s 2-cabinet custom solution compared to 200-cabinet x86 Off-the-Shelf system

Engineered to achieve application performance at < 1/100 the space, weight and power cost of an off-the shelf system

Cray designed, developed, integrated and deployed

System Characteristics

Cray Custom Solution

Off-the-Shelf System

Cabinets 2 200

Sockets 48 37,376

Core Count 96 149,504

FPGAs 88 0

Total Power 42.7 Kw 8,780 Kw

Peak Flops 499 Gf 1.2 Pf

Total Floor Space 8.87 Sq Ft 4,752 Sq Ft

Page 7: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

7

Intel P4 Northwood

Page 8: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

8

NVIDIA GT200

Page 9: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

The Energy-Flexibility Gap

Embedded Processors

ASPsDSPs

DedicatedHW ASIC

Flexibility (Coverage)

Ener

gy E

ffici

ency

(log

sca

le)

0.1

1

10

100

1000

ReconfigurableProcessor/Logic

GPUs were here 7-10 years ago

Now, they’re in this space

Page 10: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

10

GPGPU

Previous GPGPU Constraint:To get general purpose code

working, you had to use the corner cases of the graphics API

Essentially – re-write entire program as a collection of shaders and polygons

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per threadper Shaderper Context

FB Memory

General Purpose computing on Graphics Processing Units

Page 11: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

11

CUDA

“Compute Unified Device Architecture”General purpose programming model

User kicks off batches of threads on the GPU GPU = dedicated super-threaded, massively data

parallel co-processorTargeted software stack

Compute oriented drivers, language, and toolsDriver for loading computational programs

onto GPU

Page 12: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

12

NVIDIA Tesla C2090 Card Specs

512 GPU cores1.30 GHzSingle precision floating point performance: 1331 GFLOPs

(2 single precision flops per clock per core)Double precision floating point performance: 665 GFLOPs

(1 double precision flop per clock per core)Internal RAM: 6 GB DDR5Internal RAM speed: 177 GB/sec (compared 30s-ish GB/sec for

regular RAM)Has to be plugged into a PCIe slot (at most 8 GB/sec)

Page 13: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

13

Why GPGPU Processing?

Calculation: TFLOPS vs. 150 GFLOPSMemory Bandwidth: ~5-10x

Cost Benefit: GPU in every PC– massive volume

Figure 1.1. Enlarging Performance Gap between GPUs and CPUs.

Multi-core CPU

Many-core GPU

Courtesy: John Owens

Page 14: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

The Good:Performance: focused silicon useHigh bandwidth for streaming applicationsSimilar power envelope to high-end CPUsHigh volume affordable

The Bad:Programming: Streaming languages (CUDA, OpenCL, etc.)

Requires significant application intervention / development Sensitive to hardware knowledge – memories, banking, resource management, etc.

Not good at certain operations or applications Integer performance, irregular data, pointer logic, low compute intensity*

Questions about reliability / error Many have been addressed in most recent hardware models

GPUs

Page 15: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

Knights Ferry32 CoresWide vector unitsx86 ISA

Mostly a test platform at this point

Knights Corner will be first real product - 2012

Intel Many Integrated Core (MIC)

Page 16: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

Configurable logic blocksInterconnection mesh

Can be incorporated into cards or integrated inline.

FPGAs – Generated Accelerators

Page 17: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

The Good:Performance: good silicon use (do only what you need)

(maximize parallel ops/cycle)

Rapid growth: Cells, Speed, I/O Power: 1/10th CPUsFlexible: tailor to application

The Bad:Programming: VHDL, Verilog, etc.

Advances have been made here to translate high level code (C, Fortran, etc.) to HWCompile Time: Place and Route for the FPGA layout can

take multiple hoursFPGAs are typically clocked about 1/10th to 1/5th of ASICCost: They’re actually not cheap

FPGAs

Page 18: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

External – entire application offloading“Appliances” – DataPower, Azul

Attached – targeted offloadingPCIe cards – CUDA/FireStream GPUs, FPGA cards.

Integrated – tighter connectionOn-chip – AMD Fusion, Cell BE, Network processing chips

Incorporated – CPU instructionsVector instructions, FMA, Crypto-acceleration

Accelerators in a System

Page 19: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

Purdy Pictures

AMD “Fusion”

Nvidia M2090

IBM “CloudBurst”(DataPower)

Cray XK6 Integrated Hybrid Blade

Page 20: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

External – entire application offloading“Appliances” – DataPower, Azul

Attached – targeted offloadingPCIe cards – CUDA/FireStream GPUs, FPGA cards.

Integrated – tighter connectionOn-chip – AMD Fusion, Cell BE, Network processing chips

Incorporated – CPU instructionsVector instructions, FMA, Some crypto-acceleration

Accelerators in a System

Page 21: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

Programming Accelerators

C. Cascaval, et al., IBM Journal of R&D, 2010

Page 22: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

Programming accelerators requires describing:1. What portions of code will be run on the accelerator (as

opposed to on the CPU)2. How does that code map to the architecture of the

accelerator both compute elements and memories

The first is typically done on a function-by-function basisi.e. GPU kernel

The second is much more variableParallel directives, SIMT block description, VHDL/Verilog…

Integrating these is not very mature at this point, but coming

Programming Accelerators

Page 23: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

23

CUDA SAXPY

__global__ voidsaxpy_cuda(int n, float a, float *x, float *y){int i = (blockIdx.x * blockDim.x) + threadIdx.x;if(i < n)

y[i] = a*x[i] + y[i];}…int nblocks = (n + 255) / 256;

//invoke the kernel with 256 threads per blocksaxpy_cuda<<<nblocks, 256>>>(n, 2.0, x, y);

Page 24: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

There are several efforts (mostly libraries and directive methods) to lower the entry point for accelerator programmingLibrary example: Thrust – STL-like interface for GPUs

Accelerator example: OpenACC – Like OpenMP

Integrating Accelerators More Tightly

thrust :: device_vector < int > D (10 , 1) ;thrust :: fill (D . begin () , D. begin () + 7 , 9) ;thrust :: sequence (H. begin () , H. end () );…

#pragma acc parallel [clauses] { structured block }

http://www.openacc-standard.org/

Page 25: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

1. Profile your code What code is heavily used (and amenable to acceleration)

2. Write accelerator kernels for heavily used code (Amdahl) Replace CPU version with accelerator offload

3. Play “chase the bottleneck” around the accelerator AKA re-write the kernel a dozen times

4. Profit! Faster science/engineering/finance/whatever!

Developing with Accelerators

3. ???

Page 26: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

Brandon’s stuff

A Story About Acceleration

Page 27: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

Architectures are moving towards “effective use of space” (or power).

Focusing architectures on a specific task (at the expense of others) can make for very efficient/effective tools (for that task)

HPC systems are beginning to integrate acceleration at numerous levels, but “PCIe card GPU” is the most common

Exploiting the most popular accelerators requires intervention by application programmers to map codes to the architecture.

Developing for accelerators can be challenging as significantly more hardware knowledge is needed to get good performanceThere are major efforts at improving this

Big Picture

Page 28: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc

Tomorrow2 – 3 pm: CUDA Programming Part I3:30 – 5 pm: CUDA Programming Part II

WSCC 2A/2B

Tomorrow at 5:30pmBOF: Broad-based Efforts to Expand Parallelism

Preparedness in the Computing Workforce WSCC 611/612 (here)

Wednesday at 10:30amPanel/Discussion: Parallelism, the Cloud, and the Tools of

the Future for the next generation of practitioners WSCC 2A/2B

Other Sessions of Interest

Page 29: Introduction to Accelerators  and GPGPU Dan Ernst Cray, Inc