tesla gpu computing

61
© NVIDIA Corporation 2009 Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA

Upload: others

Post on 13-Feb-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tesla GPU Computing

© NVIDIA Corporation 2009

Tesla GPU ComputingA Revolution in High Performance Computing

Mark Harris, NVIDIA

Page 2: Tesla GPU Computing

© NVIDIA Corporation 2009

Agenda

Tesla GPU Computing

What is GPU Computing?

Introduction to Tesla

CUDA

CUDA Architecture

Programming & Memory Models

Programming Environment

Fermi

Next Generation Architecture

Getting Started

Resources

Page 3: Tesla GPU Computing

© NVIDIA Corporation 2009

WHAT IS GPU COMPUTING?

Tesla GPU Computing

Page 4: Tesla GPU Computing

© NVIDIA Corporation 2009

4 cores

What is GPU Computing?

Computing with CPU + GPU

Heterogeneous Computing

CPU GPU

Page 5: Tesla GPU Computing

© NVIDIA Corporation 2009

Parallel Scaling

0

200

400

600

800

1000

1200

2003 2004 2005 2006 2007 2008 2009

Peak PerformanceGflops/sec

NVIDIA GPU

x86 CPU

Tesla

G80

Tesla T10

Harpertown

3 GHz

Nehalem

3 GHz

0

20

40

60

80

100

120

2003 2004 2005 2006 2007 2008 2009

Peak Memory Bandwidth GB/sec

NVIDIA GPU

x86 CPUTesla

G80

Tesla T10

Harpertown

3 GHz

Nehalem

3 GHz

Page 6: Tesla GPU Computing

© NVIDIA Corporation 2009

Low Latency or High Throughput?

CPU

Optimized for low-latency

access to cached data sets

Control logic for out-of-order

and speculative execution

GPU

Optimized for data-parallel,

throughput computation

Architecture tolerant of

memory latency

More transistors dedicated to

computation

Page 7: Tesla GPU Computing

© NVIDIA Corporation 2009

Processing Flow

1. Copy input data from CPU memory to GPU

memory

PCI Bus

Page 8: Tesla GPU Computing

© NVIDIA Corporation 2009

Processing Flow

1. Copy input data from CPU memory to GPU

memory

2. Load GPU program and execute,

caching data on board/chip for performance

PCI Bus

Page 9: Tesla GPU Computing

© NVIDIA Corporation 2009

Processing Flow

1. Copy input data from CPU memory to GPU

memory

2. Load GPU program and execute,

caching data on chip for performance

3. Copy results from GPU memory to CPU

memory

PCI Bus

Page 10: Tesla GPU Computing

© NVIDIA Corporation 2009

INTRODUCTION TO TESLA

Tesla GPU Computing

Page 11: Tesla GPU Computing

© NVIDIA Corporation 2009

TeslaTM

High-Performance Computing

Quadro®

Design & Creation

GeForce®

Entertainment

Parallel Computing on All GPUs100+ Million CUDA GPUs Deployed

Page 12: Tesla GPU Computing

© NVIDIA Corporation 2009

Many-Core High Performance Computing

Each core has:

Floating point & Integer unit

Logic unit

Move, compare unit

Branch unit

Cores managed by thread manager

Spawn and manage over 30,000

threads

Zero-overhead thread switching

10-series GPU has 240 cores

1.4 billion transistors

1 Teraflop of processing power

Page 13: Tesla GPU Computing

© NVIDIA Corporation 2009

Tesla S1070

1U System

Tesla C1060

Computing Board

Tesla Personal

Supercomputer

GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUs

Single Precision

Performance

1.87 Teraflops 4.14 Teraflops 933 Gigaflops 3.7 Teraflops

Double Precision

Performance

156 Gigaflops 346 Gigaflops 78 Gigaflops 312 Gigaflops

Memory 8 GB (4 GB / GPU) 16 GB (4 GB / GPU) 4 GB 16 GB (4 GB / GPU)

SuperMicro 1U

GPU SuperServer

Tesla GPU Computing Products

Page 14: Tesla GPU Computing

© NVIDIA Corporation 2009

Medical Imaging

U of Utah

146X 36X

Molecular Dynamics

U of Illinois

18X

Video Transcoding

Elemental Tech

50X

Matlab Computing

AccelerEyes

100X

Astrophysics

RIKEN

149X

Financial simulation

Oxford

47X

Linear Algebra

Universidad Jaime

20X

3D Ultrasound

Techniscan

130X

Quantum Chemistry

U of Illinois

30X

Gene Sequencing

U of Maryland

Reported Speedups

Page 15: Tesla GPU Computing

© NVIDIA Corporation 2009

CUDA ARCHITECTURE

Page 16: Tesla GPU Computing

© NVIDIA Corporation 2009

CUDA Parallel Computing Architecture

Parallel computing architecture

and programming model

Includes a CUDA C compiler,

support for OpenCL and

DirectCompute

Architected to natively support

multiple computational

interfaces (standard languages

and APIs)

Page 17: Tesla GPU Computing

© NVIDIA Corporation 2009

CUDA Parallel Computing Architecture

CUDA defines:

Programming model

Memory model

Execution model

CUDA uses the GPU for general-purpose computing

Facilitate heterogeneous computing: CPU + GPU

CUDA is scalable

Scale to run on 100s of cores/1000s of parallel threads

Page 18: Tesla GPU Computing

© NVIDIA Corporation 2009

PROGRAMMING MODEL

CUDA

Page 19: Tesla GPU Computing

© NVIDIA Corporation 2009

CUDA Kernels

Parallel portion of application: execute as a kernel

Entire GPU executes kernel, many threads

CUDA threads:

Lightweight

Fast switching

1000s execute simultaneously

CPU Host Executes functions

GPU Device Executes kernels

Page 20: Tesla GPU Computing

© NVIDIA Corporation 2009

CUDA Kernels: Parallel Threads

A kernel is a function executed

on the GPU

Array of threads, in parallel

All threads execute the same

code, can take different paths

Each thread has an ID

Select input/output data

Control decisions

float x = input[threadID];

float y = func(x);

output[threadID] = y;

Page 21: Tesla GPU Computing

© NVIDIA Corporation 2009

CUDA Kernels: Subdivide into Blocks

Page 22: Tesla GPU Computing

© NVIDIA Corporation 2009

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Page 23: Tesla GPU Computing

© NVIDIA Corporation 2009

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Blocks are grouped into a grid

Page 24: Tesla GPU Computing

© NVIDIA Corporation 2009

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Blocks are grouped into a grid

A kernel is executed as a grid of blocks of threads

Page 25: Tesla GPU Computing

© NVIDIA Corporation 2009

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Blocks are grouped into a grid

A kernel is executed as a grid of blocks of threads

GPU

Page 26: Tesla GPU Computing

© NVIDIA Corporation 2009

Communication Within a Block

Threads may need to cooperate

Memory accesses

Share results

Cooperate using shared memory

Accessible by all threads within a block

Restriction to “within a block” permits scalability

Fast communication between N threads is not feasible when N large

Page 27: Tesla GPU Computing

© NVIDIA Corporation 2009

Transparent Scalability – G84

1 2 3 4 5 6 7 8 9 10 11 12

1 2

3 4

5 6

7 8

9 10

11 12

Page 28: Tesla GPU Computing

© NVIDIA Corporation 2009

Transparent Scalability – G80

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8

9 10 11 12

Page 29: Tesla GPU Computing

© NVIDIA Corporation 2009

Transparent Scalability – GT200

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12 ...Idle Idle Idle

Page 30: Tesla GPU Computing

© NVIDIA Corporation 2009

CUDA Programming Model - Summary

A kernel executes as a grid of

thread blocks

A block is a batch of threads

Communicate through shared

memory

Each block has a block ID

Each thread has a thread ID

Host

Kernel 1

Kernel 2

Device

0 1 2 3

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

1D

2D

Page 31: Tesla GPU Computing

© NVIDIA Corporation 2009

MEMORY MODEL

CUDA

Page 32: Tesla GPU Computing

© NVIDIA Corporation 2009

Memory hierarchy

Thread:

Registers

Page 33: Tesla GPU Computing

© NVIDIA Corporation 2009

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Page 34: Tesla GPU Computing

© NVIDIA Corporation 2009

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

Page 35: Tesla GPU Computing

© NVIDIA Corporation 2009

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

Page 36: Tesla GPU Computing

© NVIDIA Corporation 2009

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

All blocks:

Global memory

Page 37: Tesla GPU Computing

© NVIDIA Corporation 2009

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

All blocks:

Global memory

Page 38: Tesla GPU Computing

© NVIDIA Corporation 2009

PROGRAMMING ENVIRONMENT

CUDA

Page 39: Tesla GPU Computing

© NVIDIA Corporation 2009

CUDA APIs

API allows the host to manage the devices

Allocate memory & transfer data

Launch kernels

CUDA C “Runtime” API

High level of abstraction - start here!

CUDA C “Driver” API

More control, more verbose

OpenCL

Similar to CUDA C Driver API

Page 40: Tesla GPU Computing

© NVIDIA Corporation 2009

CUDA C and OpenCL

Shared back-end compiler

and optimization technology

Entry point for developers

who want low-level APIEntry point for developers

who prefer high-level C

Page 41: Tesla GPU Computing

© NVIDIA Corporation 2009

Visual Studio

Separate file types

.c/.cpp for host code

.cu for device/mixed code

Compilation rules: cuda.rules

Syntax highlighting

Intellisense

Integrated debugger and

profiler: Nexus

Page 42: Tesla GPU Computing

© NVIDIA Corporation 2009

NVIDIA Nexus IDE

The industry’s first IDE for massively

parallel applications

Accelerates co-processing (CPU + GPU)

application development

Complete Visual Studio-integrated

development environment

Page 43: Tesla GPU Computing

© NVIDIA Corporation 2009

NVIDIA Nexus IDE - Debugging

Page 44: Tesla GPU Computing

© NVIDIA Corporation 2009

NVIDIA Nexus IDE - Profiling

Page 45: Tesla GPU Computing

© NVIDIA Corporation 2009

Linux

Separate file types

.c/.cpp for host code

.cu for device/mixed code

Typically makefile driven

cuda-gdb for debugging

CUDA Visual Profiler

Page 46: Tesla GPU Computing

© NVIDIA Corporation 2009

NEXT GENERATION ARCHITECTURE

Fermi

Page 47: Tesla GPU Computing

© NVIDIA Corporation 2009

Introducing the Fermi Architecture

3 billion transistors

512 cores

DP performance 50% of SP

ECC

L1 and L2 Caches

GDDR5 Memory

Up to 1 Terabyte of GPU Memory

Concurrent Kernels, C++

Page 48: Tesla GPU Computing

© NVIDIA Corporation 2009

Fermi SM Architecture

32 CUDA cores per SM (512 total)

Double precision 50% of single

precision

8x over GT200

Dual Thread Scheduler

64 KB of RAM for shared memory

and L1 cache (configurable)

Page 49: Tesla GPU Computing

© NVIDIA Corporation 2009

CUDA Core Architecture

New IEEE 754-2008 floating-point

standard, surpassing even the most

advanced CPUs

Fused multiply-add (FMA) instruction

for both single and double precision

Newly designed integer ALU

optimized for 64-bit and extended

precision operations

Page 50: Tesla GPU Computing

© NVIDIA Corporation 2009

Cached Memory Hierarchy

First GPU architecture to support a

true cache hierarchy in combination

with on-chip shared memory

L1 Cache per SM (per 32 cores)

Improves bandwidth and reduces

latency

Unified L2 Cache (768 KB)

Fast, coherent data sharing across

all cores in the GPU Parallel DataCache™

Memory Hierarchy

Page 51: Tesla GPU Computing

© NVIDIA Corporation 2009

Larger, Faster Memory Interface

GDDR5 memory interface

2x speed of GDDR3

Up to 1 Terabyte of memory

attached to GPU

Operate on large data sets

Page 52: Tesla GPU Computing

© NVIDIA Corporation 2009

ECC

ECC protection for

DRAM

ECC supported for GDDR5 memory

All major internal memories

Register file, shared memory, L1 cache, L2 cache

Detect 2-bit errors, correct 1-bit errors (per word)

Page 53: Tesla GPU Computing

© NVIDIA Corporation 2009

GigaThread™ Hardware Thread Scheduler

Hierarchically manages

thousands of simultaneously

active threads

10x faster application context

switching

Concurrent kernel execution

Page 54: Tesla GPU Computing

© NVIDIA Corporation 2009

GigaThread™ Hardware Thread Scheduler

Concurrent Kernel Execution + Faster Context Switch

Serial Kernel Execution Parallel Kernel Execution

Page 55: Tesla GPU Computing

© NVIDIA Corporation 2009

GigaThread Streaming Data Transfer Engine

Dual DMA engines

Simultaneous CPUGPU and GPUCPU

data transfer

Fully overlapped with CPU and GPU

processing time

Activity Snapshot:

Page 56: Tesla GPU Computing

© NVIDIA Corporation 2009

Enhanced Software Support

Full C++ Support

Virtual functions

Try/Catch hardware support

System call support

Support for pipes, semaphores, printf, etc

Unified 64-bit memory addressing

Page 57: Tesla GPU Computing

© NVIDIA Corporation 2009

Tesla S2070

1U System

Tesla C2050

Computing Board

Tesla C2070

Computing Board

GPUs 4 Tesla GPUs 1 Tesla GPU

Double Precision

Performance

2.1 – 2.5 Teraflops 520 – 630 Gigaflops

Memory 12 GB (3 GB / GPU) 24 GB (6 GB / GPU) 3 GB 6 GB

Tesla S2050

1U System

Tesla GPU Computing Products: Fermi

Page 58: Tesla GPU Computing

© NVIDIA Corporation 2009

RESOURCES

Getting Started

Page 59: Tesla GPU Computing

© NVIDIA Corporation 2009

Getting Started

CUDA Zone

www.nvidia.com/cuda

Introductory tutorials/webinars

Forums

Documentation

Programming Guide

Best Practices Guide

Examples

CUDA SDK

Page 60: Tesla GPU Computing

© NVIDIA Corporation 2009

Libraries

NVIDIA

CUBLAS Dense linear algebra (subset of full BLAS suite)

CUFFT 1D/2D/3D real and complex

Third party

NAG Numeric libraries e.g. RNGs

CULAPACK/MAGMA

Open Source

Thrust STL/Boost style template library

CUDPP Data parallel primitives (e.g. scan, sort and reduction)

CUSP Sparse linear algebra and graph computation

Many more...

Page 61: Tesla GPU Computing

© NVIDIA Corporation 2009

Tesla GPU Computing

Questions?